Analysis and Prediction of Severity of Traffic Accident
`US Accident Traffic Severity Assessment`

## Table of Content

1. [Data Preparation](#Data-Preparation "Goto Data Preparation Section")
   - [Importing Libraries](#Importing-Libraries "Goto Importing Libraries Sub-Section")
   - [Data Loading](#Data-Loading "Goto Data Loading Sub-Section")
   - [Data Summarization](#Data-Summarization "Goto Data Summarization Sub-Section")
   - [Data Cleaning](#Data-Cleaning "Goto Data Cleaning Sub-Section")
   - [Feature Engineering](#Feature-Engineering "Goto Feature Engineering Sub-Section")
   - [Outlier Treatment](#Outlier-Treatment "Goto Outlier Treatment Sub-Section")
   
   
2. [Exploratory Data Analysis](#Exploratory-Data-Analysis "Goto Exploratory Data Analysis Section")
   - [Importing Libraries for Exploratory Data Analysis](#Importing-Libraries-for-Exploratory-Data-Analysis "Goto Importing Libraries for Exploratory Data Analysis Sub-Section")
   - [Summary Statistics](#Summary-Statistics "Goto Summary Statistics Sub-Section")
   - [Univariate Analysis](#Univariate-Analysis "Goto Univariate Analysis Sub-Section")
   - [Bivariate Analysis](#Bivariate-Analysis "Goto Bivariate Analysis Sub-Section")
   - [Multivariate Analysis](#Multivariate-Analysis "Goto Multivariate Analysis Sub-Section")
   - [Miscellaneous Plots](#Miscellaneous-Plots "Goto Miscellaneous Plots Sub-Section")
   
   
3. [Data Pre-processing](#Data-Pre-processing "Goto Data Pre-processing Section")
   - [Importing Libraries for Data Pre-processing](#Importing-Libraries-for-Data-Pre-processing "Goto Importing Libraries Sub-Section")
   - [Selecting Columns](#Selecting-Columns "Goto Selecting Columns Sub-Section")
   - [Splitting State Specific Data](#Splitting-State-Specific-Data "Goto Splitting State Specific Data Sub-Section")
   
   
4. [Model Building and Evaluation](#Model-Building-and-Evaluation "Goto Model Building and Evaluation Section")
   - [Building ML Model for State 'CA'](#Building-ML-Model-for-State-'CA' "Goto Building ML Model for State 'CA' Sub-Section")
   - [Building ML Model for State 'TX'](#Building-ML-Model-for-State-'TX' "Goto Building ML Model for State 'TX' Sub-Section")
   - [Building ML Model for State 'FL'](#Building-ML-Model-for-State-'FL' "Goto Building ML Model for State 'FL' Sub-Section")
   - [Building ML Model for State 'SC'](#Building-ML-Model-for-State-'SC' "Goto Building ML Model for State 'SC' Sub-Section")
   - [Building ML Model for State 'NC'](#Building-ML-Model-for-State-'NC' "Goto Building ML Model for State 'NC' Sub-Section")
   
   
5. [Combined Results of Models on Datasets of States](#Combined-Results-of-Models-on-Datasets-of-States "Goto Combined Results of Models on Datasets of States Section")

## Data Preparation

### Importing Libraries

In [None]:
!pip install --upgrade pip
!pip uninstall -y numpy
!pip install numpy==1.18.2
!pip uninstall -y pandas
!pip install pandas==1.0.3
!pip uninstall -y matplotlib
!pip install matplotlib==3.2.1
!pip uninstall -y wordcloud
!pip install wordcloud==1.6.0
!pip uninstall -y swifter
!pip install swifter==0.301
!pip uninstall -y seaborn
!pip install seaborn==0.10.0
!pip uninstall -y plotly
!pip install plotly==4.5.4
!pip uninstall -y tensorflow
!pip install tensorflow==2.0.0

In [None]:
# Hiding all warnings
import warnings
warnings.filterwarnings('ignore')

# import numpy, pandas and other necessary libraries
import re
import numpy as np
import pandas as pd
import swifter
from wordcloud import STOPWORDS
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [None]:
# Optimizing settings and configuraturation

import warnings
warnings.filterwarnings('ignore')

pd.options.display.max_columns = 50

### Data Loading

In [None]:
# Reading the dataset

data = pd.read_csv('../input/us-accidents/US_Accidents_June20.csv')

In [None]:
# Displaying initial data

data.head(5)

In [None]:
# Understanding the dataset | Meta Data

data.info()

In [None]:
# Understanding the dataset | Data Content

data.describe(include='all')

### Data Summarization

In [None]:
# Displaying the summary of the dataset

print('Rows     :',data.shape[0])
print('Columns  :',data.shape[1])

In [None]:
# Displaying the summary of the dataset

print('Rows     :',data.shape[0])
print('Columns  :',data.shape[1])

In [None]:
# Displaying the attributes in the dataset

print('\nAttributes:\n',data.columns.tolist())

In [None]:
# Displaying the number of missing values per attribute

print('Percentage of Missing values:\n\n',(100*data.isnull().sum()/data.shape[0]).round(2))

In [None]:
#Displaying the number of unique values per attribute

print('Unique values per column :',data.nunique())

In [None]:
# Displaying the names of the attributes which consists of numerical values

data.select_dtypes(include=['int','float']).columns

In [None]:
# Displaying the names of the attributes which consists of non-numerical values

data.select_dtypes(exclude=['int','float']).columns

### Data Cleaning

#### Addressing Dataset Completeness

In [None]:
# Displaying the list of attributes with high amount of missing values (>20%)

print('Attributes with > 20% missing values: ', data.columns[(100*data.isnull().sum()/data.shape[0]).round(2)>20].tolist())

It is important to note that observation of these attributes revealed the following:
1. 'TMC' is one of the important attribute which is communicated by the authorities, hence we would not delete it.
2. 'End_Lat' and 'End_Lng' are missing when the distance of road affected by accident is very small, thus, Start and End location would be almost same and End could be removed.
3. 'Number', 'Wind_Chill(F)' and 'Precipitation(in)' can be removed due to high percentage of missing values.

In [None]:
# Imputing the missing values in TMC to 201, since that means generic 'Accident' and we and not sure of further details.

data['TMC'].fillna(value=201, inplace=True)

In [None]:
# Removing the attributes from the dataset

data.drop(columns=['End_Lat', 'End_Lng', 'Number', 'Wind_Chill(F)', 'Precipitation(in)'], inplace=True)

It is important to note that observation of these attributes revealed the following:
1. 'Wind_Speed(mph)' is missing when the 'Wind_Direction' is calm, thus considering it 0 will be fair.
2. Other attributes except 'Wind_Speed(mph)' are missing randomly and such records are less than 5% of entire data, thus removing them.

In [None]:
# Checking if we still have missing values per attribute in our dataset
# Displaying the list of attributes with any amount of missing values (>0%)

print('Attributes with missing values: ', data.columns[(100*data.isnull().sum()/data.shape[0])>0].tolist())

In [None]:
# Setting 'Wind_Speed(mph)' as 0 for rows where 'Wind_Direction' is 'Calm', meaning almost no wind.

data.loc[data['Wind_Direction'] == 'Calm', 'Wind_Speed(mph)'] = 0

In [None]:
# Checking if we still have missing values in the attribute 'Wind_Speed(mph)'

print('Percentage of Missing values in attribute Wind_Speed(mph) is: ', (100*data['Wind_Speed(mph)'].isnull().sum()/data.shape[0]).round(2))

In [None]:
# Checking what percentage of rows will remain after dropping rows with missing values

print(f'Percentage of rows remaining after removal of rows containing missing values: {(100*data.dropna().shape[0]/data.shape[0]):.4}')
print(f'Percentage of rows deleted in order to remove missing values: {100-(100*data.dropna().shape[0]/data.shape[0]):.4}')

In [None]:
# Dropping all rows with missing values since the less than 30% records gets deleted, and we have huge dataset

data.dropna(inplace=True)

In [None]:
# Checking if we still have missing values per attribute in our dataset
# Displaying the list of attributes with any amount of missing values (>0%)

print('Attributes with Missing values: ', data.columns[(100*data.isnull().sum()/data.shape[0])>0].tolist())

#### Addressing Dataset Validity

In [None]:
# Checking for datatypes of the attributes

data.dtypes

In [None]:
# Converting datatype for attributes related to datetime.

data["Start_Time"]= pd.to_datetime(data["Start_Time"]) 
data["End_Time"]= pd.to_datetime(data["End_Time"])
data["Weather_Timestamp"]= pd.to_datetime(data["Weather_Timestamp"])

#### Addressing Dataset Consistency

In [None]:
# Checking the percentage of Duplicate records in the dataset

print(f'Percentage of duplicate records: {100-(100*data.drop_duplicates().shape[0]/data.shape[0])}')

In [None]:
# Dropping the duplicate records, if any

data.drop_duplicates(inplace=True)

In [None]:
# Adding consistency to the various wind speed direction values

data['Wind_Direction'].replace({'North': 'N'}, inplace=True)
data['Wind_Direction'].replace({'East': 'E'}, inplace=True)
data['Wind_Direction'].replace({'West': 'W'}, inplace=True)
data['Wind_Direction'].replace({'South': 'S'}, inplace=True)
data['Wind_Direction'].replace({'VAR': 'Variable'}, inplace=True)
data['Wind_Direction'].replace({'CALM': 'Calm'}, inplace=True)

data['Wind_Direction'].unique()

#### Addressing Dataset Accuracy

In [None]:
# 'End_Time' should always be greater than 'Start_Time'

data.drop(data[data['End_Time']<data['Start_Time']].index, inplace=True)

#### Dropping unnecessary attributes

In [None]:
# Since the entire dataset is of one country, we can remove the attribute 'Country' as it contains only one value

data.drop(columns=['Country'], inplace=True)

In [None]:
# Since the entire dataset contains single value for 'Turning_Loop' we can remove the attribute 'Turning_Loop'

data.drop(columns=['Turning_Loop'], inplace=True)

In [None]:
# Since the ID column is an identifier, we can remove the attribute 'ID'

data.drop(columns=['ID'], inplace=True)

### Feature Engineering

#### Deriving the attribute 'Time_Duration(min)'

In [None]:
# Deriving the attribute 'Time_Duration(min)'

data.insert(4,'Time_Duration(min)',(data['End_Time']-data['Start_Time'])//np.timedelta64(1,'m'))

In [None]:
data['Time_Duration(min)'].describe()

In [None]:
# Dropping the attribute 'End_Time' since it is redundant now

data.drop(columns=['End_Time'], inplace=True)

#### Splitting the 'Start_Time' timestamp to 'Year', 'Month', 'Day', 'Hour' and 'Weekend'

In [None]:
# Breaking 'Start_Time' into 'Year', 'Month', 'Day', 'Hour' and 'Weekend'

data['Year']=data['Start_Time'].dt.year
data['Month']=data['Start_Time'].dt.month
data['Day']=data['Start_Time'].dt.day
data['Hour']=data['Start_Time'].dt.hour
data['Minute']=data['Start_Time'].dt.minute
data['Weekday']=data['Start_Time'].dt.weekday

def weekday_text(w):
    d = {0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
    return d[w]
data['Weekday']=data['Weekday'].apply(lambda x:weekday_text(x))


In [None]:
# Dropping the attribute 'Start_Time' since it is redundant now

data.drop(columns=['Start_Time'], inplace=True)

#### Extracting 'Keyword' from 'Description' attribute

In [None]:
# Extrating keywords from the attribute 'Description' using NLP and suggestion from article
# https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34

def clean(text):
    # lowercase
    text=text.lower()
    #remove tags
    text=re.sub("</?.*?>"," <> ",text)
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    return text

data['Description'] = data['Description'].swifter.apply(lambda x:clean(x))

# removing stopwords
data['Description'] = data['Description'].swifter.apply(lambda x: ' '.join([item for item in x.split(' ') if item not in STOPWORDS]))

#show the starting few 'Descriptions'
data['Description'][:10]

In [None]:
# Getting the Description (text) column 
docs=data['Description'].tolist()

# Creating a vocabulary of words, Ignoring words that appear in 85% of documents, Eliminating stop words
cv=CountVectorizer(max_df=0.85,stop_words=STOPWORDS)
word_count_vector=cv.fit_transform(docs)

# Displaying Shape
word_count_vector.shape

In [None]:
# Generating TFIDF Transformer

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)
tfidf_transformer.idf_

In [None]:
# Sorting the feature name based on score

def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)[:3]

# Extracting 
def extract_top5_from_vector(feature_names, sorted_items):
    keyword = []

    for idx, score in sorted_items:
        keyword.append(feature_names[idx])

    return ';'.join(keyword)

In [None]:
# Getting actual feature names

feature_names=cv.get_feature_names()

In [None]:
# Extracting the 'Keywords' from 'Description' attribute of dataset

def extract_description_keywords(doc):
    #generate tf-idf for the given document
    tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))
    #sort the tf-idf vectors by descending order of scores and get features
    sorted_items=sort_coo(tf_idf_vector.tocoo(copy=False))
    return extract_top5_from_vector(feature_names,sorted_items)

data['Keywords'] = data['Description'].swifter.apply(lambda x:extract_description_keywords(x))

In [None]:
# Adding an attribute per keyword to be used for modelling

data['Keyword_1'] = data['Keywords'].swifter.apply(lambda x:str(x.split(';')[0]) if len(x.split(';'))>0 else None)
data['Keyword_2'] = data['Keywords'].swifter.apply(lambda x:str(x.split(';')[1]) if len(x.split(';'))>1 else None)
data['Keyword_3'] = data['Keywords'].swifter.apply(lambda x:str(x.split(';')[2]) if len(x.split(';'))>2 else None)

In [None]:
# Displaying the number of missing values per attribute

print('Percentage of Missing values:\n\n',(100*data.isnull().sum()/data.shape[0]).round(2))

In [None]:
# Removing records with missing values, i.e. missing 'Keywords' since they are only around 2.5%

data.dropna(inplace=True)

In [None]:
# Removing Redundant Column; 'Description' and 'Keywords'

data.drop(columns=['Description', 'Keywords'], inplace=True)

### Outlier Treatment

In [None]:
# Removing outliers | Time_Duration
# Removing records stating duration more than 12 days (since higher than 12 days is not recorded yet)

data.drop(data[data['Time_Duration(min)'] > (12*1440)].index, inplace=True)

In [None]:
# Removing outliers | Wind_Speed(mph)
# Removing records wind speed more than 260 mph (since higher than ~253mph is not recorded yet)

data.drop(data[data['Wind_Speed(mph)'] > 260].index, inplace=True)

In [None]:
# Removing outliers | Distance(mi)
# Removing records distance(mi) more than 109 miles (since higher than ~109 miles is not recorded yet)

data.drop(data[data['Distance(mi)'] > 109].index, inplace=True)

In [None]:
# Removing outliers | Temperature(F)
# Removing records temperature(f) more than 131.4 mph (since higher than ~134.1 F is not recorded yet)

data.drop(data[data['Temperature(F)'] > 134.1].index, inplace=True)

In [None]:
# Removing outliers | Pressure(in)
# Removing records pressure(in) less than 25.69 (since lesser than ~25.69 is not recorded yet)
# Removing records pressure(in) more than 32.03 (since higher than ~32.03 is not recorded yet)

data.drop(data[data['Pressure(in)'] < 25.69].index, inplace=True)
data.drop(data[data['Pressure(in)'] > 32.03].index, inplace=True)

In [None]:
# Removing outliers | Wind_Speed(mph)
# Removing records visibility(mi) more than 150 miles (since higher than ~150 miles is not recorded yet)

data.drop(data[data['Visibility(mi)'] > 150].index, inplace=True)

In [None]:
# Checking if none of the value is NaN and all of the values are finite and saving to file

if data.notnull().values.all() and not data.isnull().values.any():
    data.to_csv("/kaggle/working/data.csv", index=False)
    print('Data Saved')
else:
    print('Data Not Saved')

___

----

## Exploratory Data Analysis

### Importing Libraries for Exploratory Data Analysis

In [None]:
%reset -f

# Hiding all warnings
import warnings
warnings.filterwarnings('ignore')

# import numpy and pandas
import pandas as pd
import numpy as np
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# import for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# do an inline so that plt.show() is not required everytime
%matplotlib inline

In [None]:
# Reading the dataset

data = pd.read_csv('/kaggle/working/data.csv').dropna()

### Summary Statistics

In [None]:
# Displaying the summary of the dataset

print('Rows     :',data.shape[0])
print('Columns  :',data.shape[1])

In [None]:
# Displaying all attributes of the dataset

data.columns

In [None]:
data.dtypes

In [None]:
data.describe(exclude=[np.object]).T

In [None]:
data.describe(include=[np.object]).T

In [None]:
# Analysing the 'Severity' attribute

#Textual Representation
print(data.groupby(by='Severity').size())

### Univariate Analysis

In [None]:
# Analysing the 'TMC' attribute

#Pictorial Representation
fig = plt.figure(figsize = (12, 4))
sns.countplot(y="TMC", data=data, order=data['TMC'].value_counts().index[:10], palette='Blues_d')
plt.show()

In [None]:
# Set up the matplotlib figure
f, axes = plt.subplots(2, 2, figsize=(16, 12))

# Analysing the 'Amenity' attribute
sns.distplot(data['Year'], ax=axes[0, 0])

# Analysing the 'Amenity' attribute
sns.distplot(data['Month'], ax=axes[0, 1])

# Analysing the 'Amenity' attribute
sns.distplot(data['Day'], ax=axes[1, 0])

# Analysing the 'Amenity' attribute
sns.countplot(data['Weekday'], palette='Blues_d',ax=axes[1, 1])

In [None]:
# Set up the matplotlib figure
f, axes = plt.subplots(1, 2, figsize=(16, 6))

# Analysing the 'Start_Time' Hour, Minute attribute
sns.distplot(data['Hour'], bins=24, ax=axes[0])
sns.distplot(data['Minute'], bins=60, ax=axes[1])

In [None]:
# Analysing the 'Timestamp' (Year, Month) attribute

fig = plt.figure(figsize = (16, 4))
data.groupby(by=['Year', 'Month']).size().plot()

In [None]:
# Set up the matplotlib figure
f, axes = plt.subplots(1, 2, figsize=(16, 6))

# Analysing the 'Time_Duration(min)' attribute
sns.distplot(data['Time_Duration(min)']/60, ax=axes[0])
sns.boxplot(data['Time_Duration(min)'], ax=axes[1])

In [None]:
# Set up the matplotlib figure
f, axes = plt.subplots(1, 2, figsize=(16, 6))

# Analysing the 'Distance(mi)' attribute
sns.distplot(data['Distance(mi)'], kde=False, ax=axes[0])
sns.boxplot(data['Distance(mi)'], ax=axes[1])

In [None]:
# Analysing the 'Description' attribute

text = ' '.join(data['Keyword_1'].to_list())
wordcloud = WordCloud(width = 400, height = 400, background_color = 'white', stopwords = STOPWORDS).generate(str(text))
fig = plt.figure(figsize = (5, 5))
plt.imshow(wordcloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

In [None]:
# Analysing the 'Description' attribute

text = ' '.join(data['Keyword_2'].to_list())
wordcloud = WordCloud(width = 400, height = 400, background_color = 'white', stopwords = STOPWORDS).generate(str(text))
fig = plt.figure(figsize = (5, 5))
plt.imshow(wordcloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

In [None]:
# Analysing the 'Description' attribute

text = ' '.join(data['Keyword_3'].to_list())
wordcloud = WordCloud(width = 400, height = 400, background_color = 'white', stopwords = STOPWORDS).generate(str(text))
fig = plt.figure(figsize = (5, 5))
plt.imshow(wordcloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

In [None]:
# Analysing the 'State' attribute

fig = plt.figure(figsize = (16, 6))
sns.countplot(x='State', data=data, order=data['State'].value_counts().index, palette='Blues_d')
plt.show()

In [None]:
data.groupby(by='State').size().sort_values().plot.pie(autopct='%1.1f%%', shadow=True, figsize=(16, 16))

In [None]:
# Set up the matplotlib figure
f, axes = plt.subplots(2, 2, figsize=(16, 12))

# Analysing the 'Temperature(F)' attribute
sns.distplot(data['Temperature(F)'], ax=axes[0, 0])
sns.distplot(data['Humidity(%)'], ax=axes[0, 1])
sns.distplot(data['Pressure(in)'], ax=axes[1, 0])
sns.distplot(data['Wind_Speed(mph)'], ax=axes[1, 1])

In [None]:
# Analysing the 'Visibility(mi)' attribute

fig = plt.figure(figsize = (16, 6))
sns.distplot(data['Visibility(mi)'])
plt.show()

In [None]:
# Analysing the 'Wind_Direction' attribute

fig = plt.figure(figsize = (12, 6))
sns.countplot(y='Wind_Direction', data=data, order=data['Wind_Direction'].value_counts().index, palette='Blues_d')
plt.show()

In [None]:
# Analysing the 'Weather_Condition' attribute

fig = plt.figure(figsize = (12, 6))
sns.countplot(y='Weather_Condition', data=data, order=data['Weather_Condition'].value_counts()[:20].index, palette='Blues_d')
plt.show()

In [None]:
# Set up the matplotlib figure
f, axes = plt.subplots(3, 4, figsize=(20, 16))

# Analysing the 'Amenity' attribute
sns.countplot(x='Amenity', data=data, ax=axes[0, 0], palette='Blues_d')

# Analysing the 'Bump' attribute
sns.countplot(x='Bump', data=data, ax=axes[0, 1], palette='Blues_d')

# Analysing the 'Crossing' attribute
sns.countplot(x='Crossing', data=data, ax=axes[0, 2], palette='Blues_d')

# Analysing the 'Give_Way' attribute
sns.countplot(x='Give_Way', data=data, ax=axes[0, 3], palette='Blues_d')

# Analysing the 'Junction' attribute
sns.countplot(x='Junction', data=data, ax=axes[1, 0], palette='Blues_d')

# Analysing the 'No_Exit' attribute
sns.countplot(x='No_Exit', data=data, ax=axes[1, 1], palette='Blues_d')

# Analysing the 'Railway' attribute
sns.countplot(x='Railway', data=data, ax=axes[1, 2], palette='Blues_d')

# Analysing the 'Roundabout' attribute
sns.countplot(x='Roundabout', data=data, ax=axes[1, 3], palette='Blues_d')

# Analysing the 'Station' attribute
sns.countplot(x='Station', data=data, ax=axes[2, 0], palette='Blues_d')

# Analysing the 'Stop' attribute
sns.countplot(x='Stop', data=data, ax=axes[2, 1], palette='Blues_d')

# Analysing the 'Traffic_Calming' attribute
sns.countplot(x='Traffic_Calming', data=data, ax=axes[2, 2], palette='Blues_d')

# Analysing the 'Traffic_Signal' attribute
sns.countplot(x='Traffic_Signal', data=data, ax=axes[2, 3], palette='Blues_d')

In [None]:
# Set up the matplotlib figure
f, axes = plt.subplots(2, 2, figsize=(8, 8))

# Analysing the 'Sunrise_Sunset' attribute
sns.countplot(x='Sunrise_Sunset', data=data, ax=axes[0, 0], palette='Blues_d')

# Analysing the 'Civil_Twilight' attribute
sns.countplot(x='Civil_Twilight', data=data, ax=axes[0, 1], palette='Blues_d')

# Analysing the 'Nautical_Twilight' attribute
sns.countplot(x='Nautical_Twilight', data=data, ax=axes[1, 0], palette='Blues_d')

# Analysing the 'Astronomical_Twilight' attribute
sns.countplot(x='Astronomical_Twilight', data=data, ax=axes[1, 1], palette='Blues_d')

### Bivariate Analysis

In [None]:
# Analysing the 'TMC' & 'Severity' attribute

temp = pd.DataFrame(data.groupby(by='TMC').size())
temp = temp.sort_values(by=0, ascending=False).index[:5]

#Pictorial Representation

fig = plt.figure(figsize = (16, 6))
sns.countplot(y="TMC", data=data, order=temp, hue='Severity', palette='Blues_d')
plt.legend(loc='lower right')
plt.show()

In [None]:
# Analysing the 'Year' & 'Severity' attribute

fig = plt.figure(figsize = (16, 4))
sns.countplot(y="Year", data=data, hue='Severity', palette='Blues_d')
plt.legend(loc='lower right')
plt.show()

In [None]:
# Analysing the 'Month' & 'Severity' attribute

fig = plt.figure(figsize = (16, 6))
sns.countplot(y="Month", data=data, hue='Severity', palette='Blues_d')
plt.legend(loc='lower right')
plt.show()

In [None]:
# Analysing the 'Weekday' & 'Severity' attribute

fig = plt.figure(figsize = (16, 6))
sns.countplot(y="Weekday", data=data, hue='Severity', palette='Blues_d')
plt.legend(loc='lower right')
plt.show()

In [None]:
# Set up the matplotlib figure
f, axes = plt.subplots(1, 2, figsize=(16, 6))

# Analysing the impact of 'Time_Duration(min)' attribute on 'Severity' attribute | Scatter Plot
data.plot.scatter(x='Time_Duration(min)', y='Severity', ax=axes[0])

# Analysing the impact of 'Distance(mi)' attribute on 'Severity' attribute | Scatter Plot
data.plot.scatter(x='Distance(mi)', y='Severity', ax=axes[1])

In [None]:
# Analysing the 'State' & 'Severity' attribute

fig = plt.figure(figsize = (16, 6))
sns.countplot(x="State", data=data, order=data['State'].value_counts().index, hue='Severity', palette='Blues_d')
plt.legend(loc='lower right')
plt.show()

In [None]:
# Set up the matplotlib figure
f, axes = plt.subplots(2, 2, figsize=(16, 12))

# Analysing the impact of 'Temperature(F)' attribute on 'Severity' attribute | Scatter Plot
data.plot.scatter(x='Temperature(F)', y='Severity', ax=axes[0, 0])

# Analysing the impact of 'Humidity(%)' attribute on 'Severity' attribute | Scatter Plot
data.plot.scatter(x='Humidity(%)', y='Severity', ax=axes[0, 1])

# Analysing the impact of 'Pressure(in)' attribute on 'Severity' attribute | Scatter Plot
data.plot.scatter(x='Pressure(in)', y='Severity', ax=axes[1, 0])

# Analysing the impact of 'Wind_Speed(mph)' attribute on 'Severity' attribute | Scatter Plot
data.plot.scatter(x='Wind_Speed(mph)', y='Severity', ax=axes[1, 1])

In [None]:
# Analysing the impact of 'Visibility(mi)' attribute on 'Severity' attribute | Scatter Plot

fig = plt.figure(figsize = (8, 6))
data.plot.scatter(x='Visibility(mi)', y='Severity')
plt.show()

In [None]:
# Analysing the 'Side' & 'Severity' attribute

fig = plt.figure(figsize = (16, 6))
sns.countplot(x="Side", data=data, order=data['Side'].value_counts().index, hue='Severity', palette='Blues_d')
plt.legend(loc='lower right')
plt.show()

In [None]:
# Analysing the 'Side' & 'Severity' attribute

fig = plt.figure(figsize = (16, 6))
sns.countplot(x="Wind_Direction", data=data, order=data['Wind_Direction'].value_counts().index, hue='Severity', palette='Blues_d')
plt.legend(loc='lower right')
plt.show()

In [None]:
# Analysing the 'Weather_Condition' & 'Severity' attribute

fig = plt.figure(figsize = (16, 6))
sns.countplot(x="Weather_Condition", data=data, order=data['Weather_Condition'].value_counts()[:12].index, hue='Severity', palette='Blues_d')
plt.legend(loc='lower right')
plt.show()

In [None]:
# Set up the matplotlib figure
f, axes = plt.subplots(2, 2, figsize=(16, 6))

# Analysing the impact of 'Sunrise_Sunset' attribute on 'Severity' attribute | Box Plot
sns.boxplot(x='Sunrise_Sunset', y='Severity', data=data, ax=axes[0, 0], order=data['Sunrise_Sunset'].value_counts()[:10].index)

# Analysing the impact of 'Civil_Twilight' attribute on 'Severity' attribute | Box Plot
sns.boxplot(x='Civil_Twilight', y='Severity', data=data, ax=axes[0, 1], order=data['Civil_Twilight'].value_counts()[:10].index)

# Analysing the impact of 'Nautical_Twilight' attribute on 'Severity' attribute | Box Plot
sns.boxplot(x='Nautical_Twilight', y='Severity', data=data, ax=axes[1, 0], order=data['Nautical_Twilight'].value_counts()[:10].index)

# Analysing the impact of 'Astronomical_Twilight' attribute on 'Severity' attribute | Box Plot
sns.boxplot(x='Astronomical_Twilight', y='Severity', data=data, ax=axes[1, 1], order=data['Astronomical_Twilight'].value_counts()[:10].index)

### Multivariate Analysis

In [None]:
# plotting correlations on a heatmap

plt.figure(figsize=(16,8))
sns.heatmap(data.corr(), cmap="YlGnBu", annot=False)
plt.show()

### Miscellaneous Plots

In [None]:
BBox = ((data.Start_Lng.min(), data.Start_Lng.max(), data.Start_Lat.min(), data.Start_Lat.max()))
BBox

In [None]:
map_pic = plt.imread('map/map_pic.png')

In [None]:
fig, ax = plt.subplots(figsize = (26,14))
ax.scatter(data[data['Severity']==1].Start_Lng+0.3, data[data['Severity']==1].Start_Lat-0.8, zorder=1, alpha= 0.7, c='blue', s=4)
ax.scatter(data[data['Severity']==2].Start_Lng+0.3, data[data['Severity']==2].Start_Lat-0.8, zorder=1, alpha= 0.7, c='green', s=3)
ax.scatter(data[data['Severity']==3].Start_Lng+0.3, data[data['Severity']==3].Start_Lat-0.8, zorder=1, alpha= 0.7, c='orange', s=2)
ax.scatter(data[data['Severity']==4].Start_Lng+0.3, data[data['Severity']==4].Start_Lat-0.8, zorder=1, alpha= 0.7, c='red', s=1)

ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(map_pic, zorder=0, extent = BBox, aspect= 'auto', interpolation='none')
ax.imshow(map_pic, zorder=2, alpha= 0.5, extent = BBox, aspect= 'auto')

___

## Data Pre-Processing

### Importing Libraries for Data Pre-processing

In [None]:
# Deleting all data
%reset -f

# Reloading necessary libraries
# import numpy and pandas
import pandas as pd
import numpy as np

### Selecting Columns

In [None]:
# Selecting the features which are likely to be available initially upon the accident, plus target variable 'Severity' for model building.

# Reading the above processed data from disk
data= pd.read_csv("/kaggle/working/data.csv").dropna()

# List of all available attributes
data.columns

In [None]:
# List of selected attributes based upon fast availability of attributes, considering:
# 1. the objectibe of the research
# 2. the non repetition of information (e.g. Complete Address + Zipcode)

cols = [
        'Source', 'TMC', 'Severity', 'Start_Lat', 'Start_Lng', 'Temperature(F)',
        'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Direction', 'Wind_Speed(mph)', 'Weather_Condition',
        'Amenity', 'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway','Roundabout', 'Station', 
        'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Sunrise_Sunset', 'Civil_Twilight', 
        'Year', 'Month', 'Day', 'Hour', 'Minute','Weekday', 
        'Street', 'Side', 'City', 'County', 'Keyword_1', 'Keyword_2', 'Keyword_3'
]

### Splitting State Specific Data

In [None]:
# Generating the new dataset with selected columns and dummies for categorical data
# Checking if none of the value is NaN and all of the values are finite
# Saving the top 5 state specific dataset to disk and freeing up RAM before building model

# State CA
data[data['State']=='CA'][cols].to_csv("/kaggle/working/data_CA.csv", index=False)

# State TX
data[data['State']=='TX'][cols].to_csv("/kaggle/working/data_TX.csv", index=False)

# State FL
data[data['State']=='FL'][cols].to_csv("/kaggle/working/data_FL.csv", index=False)

# State SC
data[data['State']=='SC'][cols].to_csv("/kaggle/working/data_SC.csv", index=False)

# State NC
data[data['State']=='NC'][cols].to_csv("/kaggle/working/data_NC.csv", index=False)

___

## Model Building and Evaluation

### Building ML Model for State 'CA'

### Importing Libraries

In [None]:
# Deleting all data
%reset -f

# Reloading necessary libraries
# import numpy and pandas
import pandas as pd
import numpy as np

# import for pre-processing
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.preprocessing import OrdinalEncoder, StandardScaler

# import for visualization
import matplotlib.pyplot as plt

# import for model building
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# import for Neural Network based model building
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [None]:
# Building State specific model | State 'CA'

result = {}
state='CA'
result['State']=state
processed_data = pd.read_csv(f'/kaggle/working/data_{state}.csv').dropna()
cols = processed_data.select_dtypes(include='object').columns

In [None]:
# Class Balancing | Using Up Sampling

# Separate majority and minority classes
df_s1 = processed_data[processed_data['Severity']==1]
df_s2 = processed_data[processed_data['Severity']==2]
df_s3 = processed_data[processed_data['Severity']==3]
df_s4 = processed_data[processed_data['Severity']==4]

count = max(df_s1.count()[0], df_s2.count()[0], df_s3.count()[0], df_s4.count()[0])

# Upsample minority class
df_s1 = resample(df_s1, replace=df_s1.count()[0]<count, n_samples=count, random_state=42)
df_s2 = resample(df_s2, replace=df_s2.count()[0]<count, n_samples=count, random_state=42)
df_s3 = resample(df_s3, replace=df_s3.count()[0]<count, n_samples=count, random_state=42)
df_s4 = resample(df_s4, replace=df_s4.count()[0]<count, n_samples=count, random_state=42)
 
# Combine majority class with upsampled minority class
processed_data = pd.concat([df_s1, df_s2, df_s3, df_s4])
 
# Display new class counts
processed_data.groupby(by='Severity')['Severity'].count()

In [None]:
# Set the target for the prediction
target='Severity' 

# set X and y
y = processed_data[target]
X = processed_data.drop(target, axis=1)

# Create the encoder.
encoder = OrdinalEncoder()
X[cols] = encoder.fit_transform(X[cols])

# Split the data set into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# Split the data set into training and validation data sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, stratify=y_train, random_state=42)

# Scalling the features of Train Dataset, Validation Dataset and Test Dataset
scaler = StandardScaler()

# Scaling Train Dataset
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)

# Scaling Validation Dataset
scaler = scaler.fit(X_val)
X_val = scaler.transform(X_val)

# Scaling Test Dataset
scaler = scaler.fit(X_test)
X_test = scaler.transform(X_test)

#### VISUALIZING THE DATA ON MAP

In [None]:
BBox = ((processed_data.Start_Lng.min(), processed_data.Start_Lng.max(), processed_data.Start_Lat.min(), processed_data.Start_Lat.max()))
BBox

In [None]:
map_pic = plt.imread('/kaggle/working/map/map_pic_ca.png')

In [None]:
fig, ax = plt.subplots(figsize = (27,33))
ax.scatter(processed_data[processed_data['Severity']==1].Start_Lng, processed_data[processed_data['Severity']==1].Start_Lat-.1, zorder=1, c='b', s=4)
ax.scatter(processed_data[processed_data['Severity']==2].Start_Lng, processed_data[processed_data['Severity']==2].Start_Lat-.1, zorder=1, c='g', s=6)
ax.scatter(processed_data[processed_data['Severity']==3].Start_Lng, processed_data[processed_data['Severity']==3].Start_Lat-.1, zorder=1, c='y', s=8)
ax.scatter(processed_data[processed_data['Severity']==4].Start_Lng, processed_data[processed_data['Severity']==4].Start_Lat-.1, zorder=1, c='r', s=10)

ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(map_pic, zorder=0, extent = BBox, aspect= 'auto', interpolation='none')
ax.imshow(map_pic, zorder=2, alpha= 0.5, extent = BBox, aspect= 'auto', interpolation='lanczos')

#### BUILDING MODEL USING SUPPORT VECTOR MACHINE

In [None]:
# Support Vector Machine | First Iteration

# Instantiate an object of class SVC()
clf = SVC(gamma='auto', kernel='rbf', random_state=42)

# Train & Test (limiting rows since SVM takes much time)
clf.fit(X_train[:10000], y_train[:10000])
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Support Vector Machine accuracy_score: {:.3f}.".format(accuracy_score(y_test, y_pred)))

In [None]:
# Support Vector Machine | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'C': [0.1, 0.5, 1],
    'gamma': ['auto', 'scale']
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=5, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val[:5000], y_val[:5000])

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Support Vector Machine | Final Evaluation

# Create a SVM Classifier
clf=SVC(**grid_search.best_params_, kernel='rbf', random_state=42)

# Train & Test
clf.fit(X_train[:20000], y_train[:20000])
y_train_pred= clf.predict(X_train[:20000])
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train[:20000], y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# stroring the accuracy score
result['Support Vector Machine'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with SVM, but, the computation time is very high, even with limited dataset.

#### BUILDING MODEL USING DECISION TREE

In [None]:
# Decision Tree Algorithm | First Iteration

# Instantiate a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_pred= clf.predict(X_test)

# Print accuracy_entropy
print('Decision Tree accuracy_score: {:.3f}.'.format(accuracy_score(y_test, y_pred)))

In [None]:
# Decision Tree Algorithm | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [12, 14, 16],
    'min_samples_split': [1000, 2000, 3000],
    'min_samples_leaf': [500, 1000, 1500]
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=3, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val, y_val)

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Decision Tree Algorithm | Final Evaluation

# Instantiate a Decision Tree Classifier with Best Parameters
clf = DecisionTreeClassifier(**grid_search.best_params_, random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_train_pred= clf.predict(X_train)
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# Highlighting the significance of each of the factors in the model
feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
print("\nImportant features:\n", feature_imp.sort_values(ascending=False)[:10])

# stroring the accuracy score
result['Decision Tree'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with Decision Tree algorithm and computation time is also comparatively less.

#### BUILDING MODEL USING RANDOM FOREST

In [None]:
# Random Forest Algorithm | First Iteration

# Create a Random Forest Classifier
clf=RandomForestClassifier(n_estimators=100, bootstrap=False, min_samples_split=400, min_samples_leaf=100, n_jobs=-1, random_state=42)

# Train & Test
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Randon forest algorithm accuracy_score: {:.3f}.".format(accuracy_score(y_test, y_pred)))

In [None]:
# Random Forest Algorithm | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'n_estimators': [100, 150, 200],
    'max_depth': [12, 14, 16],
    'min_samples_split': [200, 300],
    'min_samples_leaf': [50, 75],
    'bootstrap': [False]
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=3, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val[:15000], y_val[:15000])

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Random Forest Algorithm | Final Evaluation

# Create a Random Forest Classifier
clf=RandomForestClassifier(**grid_search.best_params_, n_jobs=-1, random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_train_pred= clf.predict(X_train)
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# Highlighting the significance of each of the factors in the model
feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
print("\nImportant features:\n", feature_imp.sort_values(ascending=False)[:10])

# stroring the accuracy score
result['Random Forest'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100, 1))]

**Summary**: We are getting good accuracy with Random Forest algorithm and computation time is also comparatively less.

#### EVALUATING ADDITIONAL ALGORITHM'S PERFORMANCE

#### BUILDING MODEL USING K-NEAREST NEIGHBOR (KNN)

In [None]:
# K-Nearest Neighbor | First Iteration

# Create a k-NN classifier
clf = KNeighborsClassifier(n_jobs=-1)

# Train & Test
clf.fit(X_train[:20000], y_train[:20000])
y_train_pred= clf.predict(X_train[:20000])
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train[:20000], y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# stroring the accuracy score
result['K-Nearest Neighbors'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting poor accuracy with K-Nearest Neighbor algorithm and the computation time is very high, even with limited dataset.

#### BUILDING MODEL USING NEURAL NETWORK

In [None]:
# Neural Network | First Iteration

model = Sequential()
model.add(Dense(128, input_dim=np.size(X_train,1), activation='relu'))
model.add(Dense(64, input_dim=np.size(X_train,1), activation='relu'))
model.add(Dense(5, activation='softmax'))

# compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# build the model
history = model.fit(X_train, to_categorical(y_train.to_numpy()), 
                    epochs=5, validation_data=(X_val, to_categorical(y_val.to_numpy())), 
                    validation_steps=30, verbose=0)


loss, train_accuracy = model.evaluate(X_train, to_categorical(y_train.to_numpy()), verbose=0)
print(f"\nFor Training Dataset: Loss: {loss} and Accuracy: {train_accuracy}")

loss, test_accuracy = model.evaluate(X_test, to_categorical(y_test.to_numpy()), verbose=0)
print(f"\nFor Testing Dataset: Loss: {loss} and Accuracy: {test_accuracy}")

# stroring the accuracy score
result['Neural Network'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with Neural Network and computation time is also comparatively less.

In [None]:
# Saving the results in file

df = pd.DataFrame.from_dict(result)
df.set_index(['State'])
df.to_csv(f'/kaggle/working/result_{state}.csv', index=False)

___

### Building ML Model for State 'TX'

### Importing Libraries

In [None]:
# Deleting all data
%reset -f

# Reloading necessary libraries
# import numpy and pandas
import pandas as pd
import numpy as np

# import for pre-processing
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.preprocessing import OrdinalEncoder, StandardScaler

# import for visualization
import matplotlib.pyplot as plt

# import for model building
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# import for Neural Network based model building
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [None]:
# Building State specific model | State 'TX'

result = {}
state='TX'
result['State']=state
processed_data = pd.read_csv(f'/kaggle/working/data_{state}.csv').dropna()
cols = processed_data.select_dtypes(include='object').columns

In [None]:
# Class Balancing | Using Up Sampling

# Separate majority and minority classes
df_s1 = processed_data[processed_data['Severity']==1]
df_s2 = processed_data[processed_data['Severity']==2]
df_s3 = processed_data[processed_data['Severity']==3]
df_s4 = processed_data[processed_data['Severity']==4]

count = max(df_s1.count()[0], df_s2.count()[0], df_s3.count()[0], df_s4.count()[0])

# Upsample minority class
df_s1 = resample(df_s1, replace=df_s1.count()[0]<count, n_samples=count, random_state=42)
df_s2 = resample(df_s2, replace=df_s2.count()[0]<count, n_samples=count, random_state=42)
df_s3 = resample(df_s3, replace=df_s3.count()[0]<count, n_samples=count, random_state=42)
df_s4 = resample(df_s4, replace=df_s4.count()[0]<count, n_samples=count, random_state=42)
 
# Combine majority class with upsampled minority class
processed_data = pd.concat([df_s1, df_s2, df_s3, df_s4])
 
# Display new class counts
processed_data.groupby(by='Severity')['Severity'].count()

In [None]:
# Set the target for the prediction
target='Severity' 

# set X and y
y = processed_data[target]
X = processed_data.drop(target, axis=1)

# Create the encoder.
encoder = OrdinalEncoder()
X[cols] = encoder.fit_transform(X[cols])

# Split the data set into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# Split the data set into training and validation data sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, stratify=y_train, random_state=42)

# Scalling the features of Train Dataset, Validation Dataset and Test Dataset
scaler = StandardScaler()

# Scaling Train Dataset
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)

# Scaling Validation Dataset
scaler = scaler.fit(X_val)
X_val = scaler.transform(X_val)

# Scaling Test Dataset
scaler = scaler.fit(X_test)
X_test = scaler.transform(X_test)

#### VISUALIZING THE DATA ON MAP

In [None]:
BBox = ((processed_data.Start_Lng.min(), processed_data.Start_Lng.max(), processed_data.Start_Lat.min(), processed_data.Start_Lat.max()))
BBox

In [None]:
map_pic = plt.imread('/kaggle/working/map/map_pic_tx.png')

In [None]:
fig, ax = plt.subplots(figsize = (30,17))
ax.scatter(processed_data[processed_data['Severity']==1].Start_Lng, processed_data[processed_data['Severity']==1].Start_Lat, zorder=1, c='b', s=4)
ax.scatter(processed_data[processed_data['Severity']==2].Start_Lng, processed_data[processed_data['Severity']==2].Start_Lat, zorder=1, c='g', s=6)
ax.scatter(processed_data[processed_data['Severity']==3].Start_Lng, processed_data[processed_data['Severity']==3].Start_Lat, zorder=1, c='y', s=8)
ax.scatter(processed_data[processed_data['Severity']==4].Start_Lng, processed_data[processed_data['Severity']==4].Start_Lat, zorder=1, c='r', s=10)

ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(map_pic, zorder=0, extent = BBox, aspect= 'auto', interpolation='none')
ax.imshow(map_pic, zorder=2, alpha= 0.5, extent = BBox, aspect= 'auto', interpolation='lanczos')

#### BUILDING MODEL USING SUPPORT VECTOR MACHINE

In [None]:
# Support Vector Machine | First Iteration

# Instantiate an object of class SVC()
clf = SVC(gamma='auto', kernel='rbf', random_state=42)

# Train & Test (limiting rows since SVM takes much time)
clf.fit(X_train[:10000], y_train[:10000])
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Support Vector Machine accuracy_score: {:.3f}.".format(accuracy_score(y_test, y_pred)))

In [None]:
# Support Vector Machine | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'C': [0.1, 0.5, 1],
    'gamma': ['auto', 'scale']
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=5, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val[:5000], y_val[:5000])

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Support Vector Machine | Final Evaluation

# Create a SVM Classifier
clf=SVC(**grid_search.best_params_, random_state=42)

# Train & Test
clf.fit(X_train[:20000], y_train[:20000])
y_train_pred= clf.predict(X_train[:20000])
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train[:20000], y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# stroring the accuracy score
result['Support Vector Machine'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with SVM, but, the computation time is very high, even with limited dataset.

#### BUILDING MODEL USING DECISION TREE

In [None]:
# Decision Tree Algorithm | First Iteration

# Instantiate a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_pred= clf.predict(X_test)

# Print accuracy_entropy
print('Decision Tree accuracy_score: {:.3f}.'.format(accuracy_score(y_test, y_pred)))

In [None]:
# Decision Tree Algorithm | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [12, 14, 16],
    'min_samples_split': [600, 1000],
    'min_samples_leaf': [300, 500]
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=3, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val, y_val)

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Decision Tree Algorithm | Final Evaluation

# Instantiate a Decision Tree Classifier with Best Parameters
clf = DecisionTreeClassifier(**grid_search.best_params_, random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_train_pred= clf.predict(X_train)
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# Highlighting the significance of each of the factors in the model
feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
print("\nImportant features:\n", feature_imp.sort_values(ascending=False)[:10])

# stroring the accuracy score
result['Decision Tree'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with Decision Tree algorithm and computation time is also comparatively less.

#### BUILDING MODEL USING RANDOM FOREST

In [None]:
# Random Forest Algorithm | First Iteration

# Create a Random Forest Classifier
clf=RandomForestClassifier(n_estimators=100, bootstrap=False, min_samples_split=400, min_samples_leaf=100, n_jobs=-1, random_state=42)

# Train & Test
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Randon forest algorithm accuracy_score: {:.3f}.".format(accuracy_score(y_test, y_pred)))

In [None]:
# Random Forest Algorithm | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'n_estimators': [100, 150, 200],
    'max_depth': [14, 16, 18],
    'min_samples_split': [100, 200],
    'min_samples_leaf': [25, 50],
    'bootstrap': [False]
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=5, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val[:20000], y_val[:20000])

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Random Forest Algorithm | Final Evaluation

# Create a Random Forest Classifier
clf=RandomForestClassifier(**grid_search.best_params_, n_jobs=-1, random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_train_pred= clf.predict(X_train)
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# Highlighting the significance of each of the factors in the model
feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
print("\nImportant features:\n", feature_imp.sort_values(ascending=False)[:10])

# stroring the accuracy score
result['Random Forest'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting good accuracy with Random Forest algorithm and computation time is also comparatively less.

#### EVALUATING ADDITIONAL ALGORITHM'S PERFORMANCE

#### BUILDING MODEL USING K-NEAREST NEIGHBOR (KNN)

In [None]:
# K-Nearest Neighbor | First Iteration

# Create a k-NN classifier
clf = KNeighborsClassifier(n_jobs=-1)

# Train & Test
clf.fit(X_train[:20000], y_train[:20000])
y_train_pred= clf.predict(X_train[:20000])
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train[:20000], y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# stroring the accuracy score
result['K-Nearest Neighbors'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting poor accuracy with K-Nearest Neighbor algorithm and the computation time is very high, even with limited dataset.

#### BUILDING MODEL USING NEURAL NETWORK

In [None]:
# Neural Network | First Iteration

model = Sequential()
model.add(Dense(128, input_dim=np.size(X_train,1), activation='relu'))
model.add(Dense(64, input_dim=np.size(X_train,1), activation='relu'))
model.add(Dense(5, activation='softmax'))

# compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# build the model
history = model.fit(X_train, to_categorical(y_train.to_numpy()), 
                    epochs=5, validation_data=(X_val, to_categorical(y_val.to_numpy())), 
                    validation_steps=30, verbose=0)


loss, train_accuracy = model.evaluate(X_train, to_categorical(y_train.to_numpy()), verbose=0)
print(f"\nFor Training Dataset: Loss: {loss} and Accuracy: {train_accuracy}")

loss, test_accuracy = model.evaluate(X_test, to_categorical(y_test.to_numpy()), verbose=0)
print(f"\nFor Testing Dataset: Loss: {loss} and Accuracy: {test_accuracy}")

# stroring the accuracy score
result['Neural Network'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with Neural Network and computation time is also comparatively less.

In [None]:
# Saving the results in file

df = pd.DataFrame.from_dict(result)
df.set_index(['State'])
df.to_csv(f'/kaggle/working/result_{state}.csv', index=False)

___

### Building ML Model for State 'FL'

### Importing Libraries

In [None]:
# Deleting all data
%reset -f

# Reloading necessary libraries
# import numpy and pandas
import pandas as pd
import numpy as np

# import for pre-processing
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.preprocessing import OrdinalEncoder, StandardScaler

# import for visualization
import matplotlib.pyplot as plt

# import for model building
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# import for Neural Network based model building
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [None]:
# Building State specific model | State 'FL'

result = {}
state='FL'
result['State']=state
processed_data = pd.read_csv(f'/kaggle/working/data_{state}.csv').dropna()
cols = processed_data.select_dtypes(include='object').columns

In [None]:
# Class Balancing | Using Up Sampling

# Separate majority and minority classes
df_s1 = processed_data[processed_data['Severity']==1]
df_s2 = processed_data[processed_data['Severity']==2]
df_s3 = processed_data[processed_data['Severity']==3]
df_s4 = processed_data[processed_data['Severity']==4]

count = max(df_s1.count()[0], df_s2.count()[0], df_s3.count()[0], df_s4.count()[0])

# Upsample minority class
df_s1 = resample(df_s1, replace=df_s1.count()[0]<count, n_samples=count, random_state=42)
df_s2 = resample(df_s2, replace=df_s2.count()[0]<count, n_samples=count, random_state=42)
df_s3 = resample(df_s3, replace=df_s3.count()[0]<count, n_samples=count, random_state=42)
df_s4 = resample(df_s4, replace=df_s4.count()[0]<count, n_samples=count, random_state=42)
 
# Combine majority class with upsampled minority class
processed_data = pd.concat([df_s1, df_s2, df_s3, df_s4])
 
# Display new class counts
processed_data.groupby(by='Severity')['Severity'].count()

In [None]:
# Set the target for the prediction
target='Severity' 

# set X and y
y = processed_data[target]
X = processed_data.drop(target, axis=1)

# Create the encoder.
encoder = OrdinalEncoder()
X[cols] = encoder.fit_transform(X[cols])

# Split the data set into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# Split the data set into training and validation data sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, stratify=y_train, random_state=42)

# Scalling the features of Train Dataset, Validation Dataset and Test Dataset
scaler = StandardScaler()

# Scaling Train Dataset
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)

# Scaling Validation Dataset
scaler = scaler.fit(X_val)
X_val = scaler.transform(X_val)

# Scaling Test Dataset
scaler = scaler.fit(X_test)
X_test = scaler.transform(X_test)

#### VISUALIZING THE DATA ON MAP

In [None]:
BBox = ((processed_data.Start_Lng.min(), processed_data.Start_Lng.max(), processed_data.Start_Lat.min(), processed_data.Start_Lat.max()))
BBox

In [None]:
map_pic = plt.imread('/kaggle/working/map/map_pic_fl.png')

In [None]:
fig, ax = plt.subplots(figsize = (30,30))
ax.scatter(processed_data[processed_data['Severity']==1].Start_Lng, processed_data[processed_data['Severity']==1].Start_Lat, zorder=1, c='b', s=4)
ax.scatter(processed_data[processed_data['Severity']==2].Start_Lng, processed_data[processed_data['Severity']==2].Start_Lat, zorder=1, c='g', s=6)
ax.scatter(processed_data[processed_data['Severity']==3].Start_Lng, processed_data[processed_data['Severity']==3].Start_Lat, zorder=1, c='y', s=8)
ax.scatter(processed_data[processed_data['Severity']==4].Start_Lng, processed_data[processed_data['Severity']==4].Start_Lat, zorder=1, c='r', s=10)

ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(map_pic, zorder=0, extent = BBox, aspect= 'auto', interpolation='none')
ax.imshow(map_pic, zorder=2, alpha= 0.5, extent = BBox, aspect= 'auto', interpolation='lanczos')

#### BUILDING MODEL USING SUPPORT VECTOR MACHINE

In [None]:
# Support Vector Machine | First Iteration

# Instantiate an object of class SVC()
clf = SVC(gamma='auto', kernel='rbf', random_state=42)

# Train & Test (limiting rows since SVM takes much time)
clf.fit(X_train[:10000], y_train[:10000])
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Support Vector Machine accuracy_score: {:.3f}.".format(accuracy_score(y_test, y_pred)))

In [None]:
# Support Vector Machine | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'C': [0.1, 0.5, 1],
    'gamma': ['auto', 'scale']
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=5, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val[:5000], y_val[:5000])

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Support Vector Machine | Final Evaluation

# Create a SVM Classifier
clf=SVC(**grid_search.best_params_, random_state=42)

# Train & Test
clf.fit(X_train[:20000], y_train[:20000])
y_train_pred= clf.predict(X_train[:20000])
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train[:20000], y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# stroring the accuracy score
result['Support Vector Machine'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with SVM, but, the computation time is very high, even with limited dataset.

#### BUILDING MODEL USING DECISION TREE

In [None]:
# Decision Tree Algorithm | First Iteration

# Instantiate a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_pred= clf.predict(X_test)

# Print accuracy_entropy
print('Decision Tree accuracy_score: {:.3f}.'.format(accuracy_score(y_test, y_pred)))

In [None]:
# Decision Tree Algorithm | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [12, 14, 16],
    'min_samples_split': [2000, 4000],
    'min_samples_leaf': [1000, 2000]
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=3, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val, y_val)

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Decision Tree Algorithm | Final Evaluation

# Instantiate a Decision Tree Classifier with Best Parameters
clf = DecisionTreeClassifier(**grid_search.best_params_, random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_train_pred= clf.predict(X_train)
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# Highlighting the significance of each of the factors in the model
feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
print("\nImportant features:\n", feature_imp.sort_values(ascending=False)[:10])

# stroring the accuracy score
result['Decision Tree'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with Decision Tree algorithm and computation time is also comparatively less.

#### BUILDING MODEL USING RANDOM FOREST

In [None]:
# Random Forest Algorithm | First Iteration

# Create a Random Forest Classifier
clf=RandomForestClassifier(n_estimators=100, min_samples_split=400, min_samples_leaf=100, n_jobs=-1, random_state=42)

# Train & Test
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Randon forest algorithm accuracy_score: {:.3f}.".format(accuracy_score(y_test, y_pred)))

In [None]:
# Random Forest Algorithm | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'n_estimators': [100, 150, 200],
    'max_depth': [14, 16, 18],
    'min_samples_split': [100, 200],
    'min_samples_leaf': [25, 50],
    'bootstrap': [False]
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=5, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val[:20000], y_val[:20000])

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Random Forest Algorithm | Final Evaluation

# Create a Random Forest Classifier
clf=RandomForestClassifier(**grid_search.best_params_, n_jobs=-1, random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_train_pred= clf.predict(X_train)
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# Highlighting the significance of each of the factors in the model
feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
print("\nImportant features:\n", feature_imp.sort_values(ascending=False)[:10])

# stroring the accuracy score
result['Random Forest'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting good accuracy with Random Forest algorithm and computation time is also comparatively less.

#### EVALUATING ADDITIONAL ALGORITHM'S PERFORMANCE

#### BUILDING MODEL USING K-NEAREST NEIGHBOR (KNN)

In [None]:
# K-Nearest Neighbor | First Iteration

# Create a k-NN classifier
clf = KNeighborsClassifier(n_jobs=-1)

# Train & Test
clf.fit(X_train[:20000], y_train[:20000])
y_train_pred= clf.predict(X_train[:20000])
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train[:20000], y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# stroring the accuracy score
result['K-Nearest Neighbors'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting poor accuracy with K-Nearest Neighbor algorithm and the computation time is very high, even with limited dataset.

#### BUILDING MODEL USING NEURAL NETWORK

In [None]:
# Neural Network | First Iteration

model = Sequential()
model.add(Dense(128, input_dim=np.size(X_train,1), activation='relu'))
model.add(Dense(64, input_dim=np.size(X_train,1), activation='relu'))
model.add(Dense(5, activation='softmax'))

# compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# build the model
history = model.fit(X_train, to_categorical(y_train.to_numpy()), 
                    epochs=5, validation_data=(X_val, to_categorical(y_val.to_numpy())), 
                    validation_steps=30, verbose=0)


loss, train_accuracy = model.evaluate(X_train, to_categorical(y_train.to_numpy()), verbose=0)
print(f"\nFor Training Dataset: Loss: {loss} and Accuracy: {train_accuracy}")

loss, test_accuracy = model.evaluate(X_test, to_categorical(y_test.to_numpy()), verbose=0)
print(f"\nFor Testing Dataset: Loss: {loss} and Accuracy: {test_accuracy}")

# stroring the accuracy score
result['Neural Network'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with Neural Network and computation time is also comparatively less.

In [None]:
# Saving the results in file

df = pd.DataFrame.from_dict(result)
df.set_index(['State'])
df.to_csv(f'/kaggle/working/result_{state}.csv', index=False)

___

### Building ML Model for State 'SC'

### Importing Libraries

In [None]:
# Deleting all data
%reset -f

# Reloading necessary libraries
# import numpy and pandas
import pandas as pd
import numpy as np

# import for pre-processing
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.preprocessing import OrdinalEncoder, StandardScaler

# import for visualization
import matplotlib.pyplot as plt

# import for model building
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# import for Neural Network based model building
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [None]:
# Building State specific model | State 'SC'

result = {}
state='SC'
result['State']=state
processed_data = pd.read_csv(f'/kaggle/working/data_{state}.csv').dropna()
cols = processed_data.select_dtypes(include='object').columns

In [None]:
# Class Balancing | Using Up Sampling

# Separate majority and minority classes
df_s1 = processed_data[processed_data['Severity']==1]
df_s2 = processed_data[processed_data['Severity']==2]
df_s3 = processed_data[processed_data['Severity']==3]
df_s4 = processed_data[processed_data['Severity']==4]

count = max(df_s1.count()[0], df_s2.count()[0], df_s3.count()[0], df_s4.count()[0])

# Upsample minority class
df_s1 = resample(df_s1, replace=df_s1.count()[0]<count, n_samples=count, random_state=42)
df_s2 = resample(df_s2, replace=df_s2.count()[0]<count, n_samples=count, random_state=42)
df_s3 = resample(df_s3, replace=df_s3.count()[0]<count, n_samples=count, random_state=42)
df_s4 = resample(df_s4, replace=df_s4.count()[0]<count, n_samples=count, random_state=42)
 
# Combine majority class with upsampled minority class
processed_data = pd.concat([df_s1, df_s2, df_s3, df_s4])
 
# Display new class counts
processed_data.groupby(by='Severity')['Severity'].count()

In [None]:
# Set the target for the prediction
target='Severity' 

# set X and y
y = processed_data[target]
X = processed_data.drop(target, axis=1)

# Create the encoder.
encoder = OrdinalEncoder()
X[cols] = encoder.fit_transform(X[cols])

# Split the data set into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# Split the data set into training and validation data sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, stratify=y_train, random_state=42)

# Scalling the features of Train Dataset, Validation Dataset and Test Dataset
scaler = StandardScaler()

# Scaling Train Dataset
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)

# Scaling Validation Dataset
scaler = scaler.fit(X_val)
X_val = scaler.transform(X_val)

# Scaling Test Dataset
scaler = scaler.fit(X_test)
X_test = scaler.transform(X_test)

#### VISUALIZING THE DATA ON MAP

In [None]:
BBox = ((processed_data.Start_Lng.min(), processed_data.Start_Lng.max(), processed_data.Start_Lat.min(), processed_data.Start_Lat.max()))
BBox

In [None]:
map_pic = plt.imread('/kaggle/working/map/map_pic_sc.png')

In [None]:
fig, ax = plt.subplots(figsize = (34,26))
ax.scatter(processed_data[processed_data['Severity']==1].Start_Lng, processed_data[processed_data['Severity']==1].Start_Lat-0.01, zorder=1, c='b', s=4)
ax.scatter(processed_data[processed_data['Severity']==2].Start_Lng, processed_data[processed_data['Severity']==2].Start_Lat-0.01, zorder=1, c='g', s=6)
ax.scatter(processed_data[processed_data['Severity']==3].Start_Lng, processed_data[processed_data['Severity']==3].Start_Lat-0.01, zorder=1, c='y', s=8)
ax.scatter(processed_data[processed_data['Severity']==4].Start_Lng, processed_data[processed_data['Severity']==4].Start_Lat-0.01, zorder=1, c='r', s=10)

ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(map_pic, zorder=0, extent = BBox, aspect= 'auto', interpolation='none')
ax.imshow(map_pic, zorder=2, alpha= 0.5, extent = BBox, aspect= 'auto', interpolation='lanczos')

#### BUILDING MODEL USING SUPPORT VECTOR MACHINE

In [None]:
# Support Vector Machine | First Iteration

# Instantiate an object of class SVC()
clf = SVC(gamma='auto', kernel='rbf', random_state=42)

# Train & Test (limiting rows since SVM takes much time)
clf.fit(X_train[:10000], y_train[:10000])
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Support Vector Machine accuracy_score: {:.3f}.".format(accuracy_score(y_test, y_pred)))

In [None]:
# Support Vector Machine | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'C': [0.1, 0.5, 1],
    'gamma': ['auto', 'scale']
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=5, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val[:5000], y_val[:5000])

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Support Vector Machine | Final Evaluation

# Create a SVM Classifier
clf=SVC(**grid_search.best_params_, random_state=42)

# Train & Test
clf.fit(X_train[:20000], y_train[:20000])
y_train_pred= clf.predict(X_train[:20000])
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train[:20000], y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# stroring the accuracy score
result['Support Vector Machine'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with SVM, but, the computation time is very high, even with limited dataset.

#### BUILDING MODEL USING DECISION TREE

In [None]:
# Decision Tree Algorithm | First Iteration

# Instantiate a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_pred= clf.predict(X_test)

# Print accuracy_entropy
print('Decision Tree accuracy_score: {:.3f}.'.format(accuracy_score(y_test, y_pred)))

In [None]:
# Decision Tree Algorithm | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [12, 14, 16],
    'min_samples_split': [500, 1000],
    'min_samples_leaf': [250, 500]
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=3, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val, y_val)

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Decision Tree Algorithm | Final Evaluation

# Instantiate a Decision Tree Classifier with Best Parameters
clf = DecisionTreeClassifier(**grid_search.best_params_, random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_train_pred= clf.predict(X_train)
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# Highlighting the significance of each of the factors in the model
feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
print("\nImportant features:\n", feature_imp.sort_values(ascending=False)[:10])

# stroring the accuracy score
result['Decision Tree'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with Decision Tree algorithm and computation time is also comparatively less.

#### BUILDING MODEL USING RANDOM FOREST

In [None]:
# Random Forest Algorithm | First Iteration

# Create a Random Forest Classifier
clf=RandomForestClassifier(n_estimators=100, min_samples_split=400, min_samples_leaf=100, n_jobs=-1, random_state=42)

# Train & Test
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Randon forest algorithm accuracy_score: {:.3f}.".format(accuracy_score(y_test, y_pred)))

In [None]:
# Random Forest Algorithm | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'n_estimators': [100, 150, 200],
    'max_depth': [12, 14, 16],
    'min_samples_split': [100, 200],
    'min_samples_leaf': [25, 50],
    'bootstrap': [False]
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=5, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val[:20000], y_val[:20000])

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Random Forest Algorithm | Final Evaluation

# Create a Random Forest Classifier
clf=RandomForestClassifier(**grid_search.best_params_, n_jobs=-1, random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_train_pred= clf.predict(X_train)
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# Highlighting the significance of each of the factors in the model
feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
print("\nImportant features:\n", feature_imp.sort_values(ascending=False)[:10])

# stroring the accuracy score
result['Random Forest'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting good accuracy with Random Forest algorithm and computation time is also comparatively less.

#### EVALUATING ADDITIONAL ALGORITHM'S PERFORMANCE

#### BUILDING MODEL USING K-NEAREST NEIGHBOR (KNN)

In [None]:
# K-Nearest Neighbor | First Iteration

# Create a k-NN classifier
clf = KNeighborsClassifier(n_jobs=-1)

# Train & Test
clf.fit(X_train[:20000], y_train[:20000])
y_train_pred= clf.predict(X_train[:20000])
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train[:20000], y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# stroring the accuracy score
result['K-Nearest Neighbors'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting poor accuracy with K-Nearest Neighbor algorithm and the computation time is very high, even with limited dataset.

#### BUILDING MODEL USING NEURAL NETWORK

In [None]:
# Neural Network | First Iteration

model = Sequential()
model.add(Dense(128, input_dim=np.size(X_train,1), activation='relu'))
model.add(Dense(64, input_dim=np.size(X_train,1), activation='relu'))
model.add(Dense(5, activation='softmax'))

# compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# build the model
history = model.fit(X_train, to_categorical(y_train.to_numpy()), 
                    epochs=5, validation_data=(X_val, to_categorical(y_val.to_numpy())), 
                    validation_steps=30, verbose=0)


loss, train_accuracy = model.evaluate(X_train, to_categorical(y_train.to_numpy()), verbose=0)
print(f"\nFor Training Dataset: Loss: {loss} and Accuracy: {train_accuracy}")

loss, test_accuracy = model.evaluate(X_test, to_categorical(y_test.to_numpy()), verbose=0)
print(f"\nFor Testing Dataset: Loss: {loss} and Accuracy: {test_accuracy}")

# stroring the accuracy score
result['Neural Network'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with Neural Network and computation time is also comparatively less.

In [None]:
# Saving the results in file

df = pd.DataFrame.from_dict(result)
df.set_index(['State'])
df.to_csv(f'/kaggle/working/result_{state}.csv', index=False)

___

### Building ML Model for State 'NC'

### Importing Libraries

In [None]:
# Deleting all data
%reset -f

# Reloading necessary libraries
# import numpy and pandas
import pandas as pd
import numpy as np

# import for pre-processing
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.preprocessing import OrdinalEncoder, StandardScaler

# import for visualization
import matplotlib.pyplot as plt

# import for model building
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# import for Neural Network based model building
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [None]:
# Building State specific model | State 'NC'

result = {}
state='NC'
result['State']=state
processed_data = pd.read_csv(f'/kaggle/working/data_{state}.csv').dropna()
cols = processed_data.select_dtypes(include='object').columns

In [None]:
# Class Balancing | Using Up Sampling

# Separate majority and minority classes
df_s1 = processed_data[processed_data['Severity']==1]
df_s2 = processed_data[processed_data['Severity']==2]
df_s3 = processed_data[processed_data['Severity']==3]
df_s4 = processed_data[processed_data['Severity']==4]

count = max(df_s1.count()[0], df_s2.count()[0], df_s3.count()[0], df_s4.count()[0])

# Upsample minority class
df_s1 = resample(df_s1, replace=df_s1.count()[0]<count, n_samples=count, random_state=42)
df_s2 = resample(df_s2, replace=df_s2.count()[0]<count, n_samples=count, random_state=42)
df_s3 = resample(df_s3, replace=df_s3.count()[0]<count, n_samples=count, random_state=42)
df_s4 = resample(df_s4, replace=df_s4.count()[0]<count, n_samples=count, random_state=42)
 
# Combine majority class with upsampled minority class
processed_data = pd.concat([df_s1, df_s2, df_s3, df_s4])
 
# Display new class counts
processed_data.groupby(by='Severity')['Severity'].count()

In [None]:
# Set the target for the prediction
target='Severity' 

# set X and y
y = processed_data[target]
X = processed_data.drop(target, axis=1)

# Create the encoder.
encoder = OrdinalEncoder()
X[cols] = encoder.fit_transform(X[cols])

# Split the data set into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# Split the data set into training and validation data sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, stratify=y_train, random_state=42)

# Scalling the features of Train Dataset, Validation Dataset and Test Dataset
scaler = StandardScaler()

# Scaling Train Dataset
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)

# Scaling Validation Dataset
scaler = scaler.fit(X_val)
X_val = scaler.transform(X_val)

# Scaling Test Dataset
scaler = scaler.fit(X_test)
X_test = scaler.transform(X_test)

#### VISUALIZING THE DATA ON MAP

In [None]:
BBox = ((processed_data.Start_Lng.min(), processed_data.Start_Lng.max(), processed_data.Start_Lat.min(), processed_data.Start_Lat.max()))
BBox

In [None]:
map_pic = plt.imread('/kaggle/working/map/map_pic_nc.png')

In [None]:
fig, ax = plt.subplots(figsize = (31,12))
ax.scatter(processed_data[processed_data['Severity']==1].Start_Lng, processed_data[processed_data['Severity']==1].Start_Lat, zorder=1, c='b', s=4)
ax.scatter(processed_data[processed_data['Severity']==2].Start_Lng, processed_data[processed_data['Severity']==2].Start_Lat, zorder=1, c='g', s=6)
ax.scatter(processed_data[processed_data['Severity']==3].Start_Lng, processed_data[processed_data['Severity']==3].Start_Lat, zorder=1, c='y', s=8)
ax.scatter(processed_data[processed_data['Severity']==4].Start_Lng, processed_data[processed_data['Severity']==4].Start_Lat, zorder=1, c='r', s=10)

ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(map_pic, zorder=0, extent = BBox, aspect= 'auto', interpolation='none')
ax.imshow(map_pic, zorder=2, alpha= 0.5, extent = BBox, aspect= 'auto', interpolation='lanczos')

#### BUILDING MODEL USING SUPPORT VECTOR MACHINE

In [None]:
# Support Vector Machine | First Iteration

# Instantiate an object of class SVC()
clf = SVC(gamma='auto', kernel='rbf', random_state=42)

# Train & Test (limiting rows since SVM takes much time)
clf.fit(X_train[:10000], y_train[:10000])
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Support Vector Machine accuracy_score: {:.3f}.".format(accuracy_score(y_test, y_pred)))

In [None]:
# Support Vector Machine | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'C': [0.1, 0.5, 1],
    'gamma': ['auto', 'scale']
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=5, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val[:5000], y_val[:5000])

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Support Vector Machine | Final Evaluation

# Create a SVM Classifier
clf=SVC(**grid_search.best_params_, random_state=42)

# Train & Test
clf.fit(X_train[:20000], y_train[:20000])
y_train_pred= clf.predict(X_train[:20000])
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train[:20000], y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# stroring the accuracy score
result['Support Vector Machine'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with SVM, but, the computation time is very high, even with limited dataset.

#### BUILDING MODEL USING DECISION TREE

In [None]:
# Decision Tree Algorithm | First Iteration

# Instantiate a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_pred= clf.predict(X_test)

# Print accuracy_entropy
print('Decision Tree accuracy_score: {:.3f}.'.format(accuracy_score(y_test, y_pred)))

In [None]:
# Decision Tree Algorithm | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [12, 14, 16],
    'min_samples_split': [1000, 2000],
    'min_samples_leaf': [500, 1000]
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=3, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val, y_val)

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Decision Tree Algorithm | Final Evaluation

# Instantiate a Decision Tree Classifier with Best Parameters
clf = DecisionTreeClassifier(**grid_search.best_params_, random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_train_pred= clf.predict(X_train)
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# Highlighting the significance of each of the factors in the model
feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
print("\nImportant features:\n", feature_imp.sort_values(ascending=False)[:10])

# stroring the accuracy score
result['Decision Tree'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with Decision Tree algorithm and computation time is also comparatively less.

#### BUILDING MODEL USING RANDOM FOREST

In [None]:
# Random Forest Algorithm | First Iteration

# Create a Random Forest Classifier
clf=RandomForestClassifier(n_estimators=100, min_samples_split=400, min_samples_leaf=100, n_jobs=-1, random_state=42)

# Train & Test
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Randon forest algorithm accuracy_score: {:.3f}.".format(accuracy_score(y_test, y_pred)))

In [None]:
# Random Forest Algorithm | Optimization

# Create the parameter grid based on the results of random search 
param_grid = {
    'n_estimators': [100, 150, 200],
    'max_depth': [12, 14, 16],
    'min_samples_split': [100, 200],
    'min_samples_leaf': [25, 50],
    'bootstrap': [False]
}

# Instantiate the grid search model
grid_search = GridSearchCV(cv=5, estimator = clf, param_grid = param_grid, scoring='balanced_accuracy', n_jobs = -1,verbose = 5)

# Fit the grid search to the Validation Dataset
grid_search.fit(X_val[:20000], y_val[:20000])

# printing the optimal accuracy score and hyperparameters
print('We can get accuracy of',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Random Forest Algorithm | Final Evaluation

# Create a Random Forest Classifier
clf=RandomForestClassifier(**grid_search.best_params_, n_jobs=-1, random_state=42)

# Train & Test
clf.fit(X_train, y_train)
y_train_pred= clf.predict(X_train)
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# Highlighting the significance of each of the factors in the model
feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
print("\nImportant features:\n", feature_imp.sort_values(ascending=False)[:10])

# stroring the accuracy score
result['Random Forest'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting good accuracy with Random Forest algorithm and computation time is also comparatively less.

#### EVALUATING ADDITIONAL ALGORITHM'S PERFORMANCE

#### BUILDING MODEL USING K-NEAREST NEIGHBOR (KNN)

In [None]:
# K-Nearest Neighbor | First Iteration

# Create a k-NN classifier
clf = KNeighborsClassifier(n_jobs=-1)

# Train & Test
clf.fit(X_train[:20000], y_train[:20000])
y_train_pred= clf.predict(X_train[:20000])
y_test_pred= clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
# Detailed report of classification done by model

train_accuracy, test_accuracy = accuracy_score(y_train[:20000], y_train_pred), accuracy_score(y_test, y_test_pred)
print(classification_report(y_test, y_test_pred))
print(f'Accuracy for the train dataset {train_accuracy:.1%}')
print(f'Accuracy for the test dataset {test_accuracy:.1%}')

# stroring the accuracy score
result['K-Nearest Neighbors'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting poor accuracy with K-Nearest Neighbor algorithm and the computation time is very high, even with limited dataset.

#### BUILDING MODEL USING NEURAL NETWORK

In [None]:
# Neural Network | First Iteration

model = Sequential()
model.add(Dense(128, input_dim=np.size(X_train,1), activation='relu'))
model.add(Dense(64, input_dim=np.size(X_train,1), activation='relu'))
model.add(Dense(5, activation='softmax'))

# compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# build the model
history = model.fit(X_train, to_categorical(y_train.to_numpy()), 
                    epochs=5, validation_data=(X_val, to_categorical(y_val.to_numpy())), 
                    validation_steps=30, verbose=0)


loss, train_accuracy = model.evaluate(X_train, to_categorical(y_train.to_numpy()), verbose=0)
print(f"\nFor Training Dataset: Loss: {loss} and Accuracy: {train_accuracy}")

loss, test_accuracy = model.evaluate(X_test, to_categorical(y_test.to_numpy()), verbose=0)
print(f"\nFor Testing Dataset: Loss: {loss} and Accuracy: {test_accuracy}")

# stroring the accuracy score
result['Neural Network'] = ['Train: '+str(round(train_accuracy*100, 1))+', Test: '+str(round(test_accuracy*100,1))]

**Summary**: We are getting decent accuracy with Neural Network and computation time is also comparatively less.

In [None]:
# Saving the results in file

df = pd.DataFrame.from_dict(result)
df.set_index(['State'])
df.to_csv(f'/kaggle/working/result_{state}.csv', index=False)

___

## Combined Results of Models on Datasets of States

In [None]:
df = pd.concat([
    pd.read_csv('/kaggle/working/result_CA.csv'),pd.read_csv('/kaggle/working/result_TX.csv'),
    pd.read_csv('/kaggle/working/result_FL.csv'),pd.read_csv('/kaggle/working/result_SC.csv'),
    pd.read_csv('/kaggle/working/result_NC.csv')]).set_index(['State'])
df

___