# Coursera Capstone Project

# Introduction

##### Did you know that on average 3,700 people lose their lives everyday due to road accidents? That is 1.35 million people a year! Not only that, another 50 million people suffer long term disabilities! 
##### Countries all over the world work hard to reduce the number of fatalities and injuries that are caused by road accidents. From bicycle dedicated lanes, speed control, black-point systems to awareness campaigns, all these did help in reducing the number of accidents. Yet, "Road crashes are the leading cause of death in the U.S. for people aged 1-54" as reported by The Association for Safe International Road Travel. In this project, I will attempt to create a model that can predict the severity of an accident given the current weather and road conditions. This model can help warn drivers of the risk they are going to be taking when they decide to drive. In turn, they might change their route or even postpone the commute if its not urgent.

# Business Understanding

##### Less accidents means less damages on health & property. Tools that can reduce the number of accidents can be of a great value to governments who would want a smooth traffic flow across their cities and protect the wellbeing of their citizens. Also, insurance companies who would be interested in minimizing their losses of repairing properties & covering medical bills. In this project, we will use the data collected by Seattle SPOT Traffic Management Division which contains records of all accidents from 2004 to present to help us predict the severity of accidents that could happen as a result of the current weather and road conditions. The record contains 37 attributes which we will analyze and use to be able to predict the severity of potential accidents with the highest level of accuracy.

# Data Understanding

##### The dataset we selected has 194,673 rows and 37 different independent attributes. The target variable Y also known as Dependent Variable will be SEVERITY CODE. We will use the relevant attributes from the 37 attributes as our independent variable X. The data set is quite large and we need to get rid of irrelevant columns, rows with missing data which we can't make an assumption for and also we need to make sure our data set is balanced. 

##### The dependent variable, “SEVERITYCODE”, contains numbers that correspond to different levels of severity caused by an accident. The code that corresponds to the severity of the collision:

  * 3—fatality
  * 2b—serious injury
  *  2—injury
  *  1—prop damage
  *  0—unknown
  
##### At this point we would like to note that the dataset provided by Seattle SPOT Traffic Management Division has data for accidents with “SEVERITYCODE” 1 and 2 only.

##### The independent variables are many and we list below the most important ones.

  *  PERSONCOUNT: The total number of people involved in the collision helps identify severity involved
  *  PEDCOUNT: The number of pedestrians involved in the collision helps identify severity involved
  *  PEDCYLCOUNT: The number of bicycles involved in the collision helps identify severity involved
  *  VEHCOUNT: The number of vehicles involved in the collision identify severity involved
  *  INCDATE  : The date of the incident.  
  *  ADDRTYPE: Collision address type: Alley, Block, Intersection
  *  COLLISIONTYPE: Collision Type
  *  JUNCTIONTYPE: Category of junction at which collision took place helps identify where most collisions occur
  *  INATTENTIONIND: Whether or not collision was due to inattention.
  *  UNDERINFL: Whether or not a driver involved was under the influence of drugs or alcohol. 
  *  LOCATION: Description of the general location of the collision
  *  WEATHER: A description of the weather conditions during the time of the collision
  *  ROADCOND: The condition of the road during the collision
  *  LIGHTCOND: The light conditions during the collision
  *  SPEEDING: Whether speeding was a factor in the collision (Y/N)
  
##### Attributes like "INATTENTIONIND", "UNDERINFL" & "SPEEDING" cause noise in our data as we are interested in the effect of the weather and road condition on the severity of the accident. We know that these attributes do contribute to the severity of the accident but for the purpose of analyzing the weather and road condition only we will drop incidents that had been affected by "INATTENTIONIND", "UNDERINFL" & "SPEEDING".

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime
import time
from scipy import stats
import matplotlib.pylab as plt

In [1]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


# Loading the Data

In [6]:
df = pd.read_csv("Data-Collisions-Edited.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [7]:
#Filter out attributes that will be used
df=df[['SEVERITYCODE','PERSONCOUNT','PEDCOUNT','PEDCYLCOUNT','VEHCOUNT','INCDATES','ADDRTYPE','COLLISIONTYPE','JUNCTIONTYPE','INATTENTIONIND','UNDERINFL','WEATHER','ROADCOND','LIGHTCOND','SPEEDING']].copy()
df.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATES,ADDRTYPE,COLLISIONTYPE,JUNCTIONTYPE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING
0,2,2,0,0,2,2013/03/27,Intersection,Angles,At Intersection (intersection related),,N,Overcast,Wet,Daylight,
1,1,2,0,0,2,2006/12/20,Block,Sideswipe,Mid-Block (not related to intersection),,0,Raining,Wet,Dark - Street Lights On,
2,1,4,0,0,3,2004/11/18,Block,Parked Car,Mid-Block (not related to intersection),,0,Overcast,Dry,Daylight,
3,1,3,0,0,3,2013/03/29,Block,Other,Mid-Block (not related to intersection),,N,Clear,Dry,Daylight,
4,2,2,0,0,2,2004/01/28,Intersection,Angles,At Intersection (intersection related),,0,Raining,Wet,Daylight,


In [8]:
df.shape

(194673, 15)

# Cleaning the Data

#### UNDERINFL seem to be a mix of Y & N and Boolean Values 0 & 1. Here We change all to boolean

In [9]:
df["UNDERINFL"].replace("N",0,inplace=True)
df["UNDERINFL"].replace("Y",1,inplace=True)
df.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATES,ADDRTYPE,COLLISIONTYPE,JUNCTIONTYPE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING
0,2,2,0,0,2,2013/03/27,Intersection,Angles,At Intersection (intersection related),,0,Overcast,Wet,Daylight,
1,1,2,0,0,2,2006/12/20,Block,Sideswipe,Mid-Block (not related to intersection),,0,Raining,Wet,Dark - Street Lights On,
2,1,4,0,0,3,2004/11/18,Block,Parked Car,Mid-Block (not related to intersection),,0,Overcast,Dry,Daylight,
3,1,3,0,0,3,2013/03/29,Block,Other,Mid-Block (not related to intersection),,0,Clear,Dry,Daylight,
4,2,2,0,0,2,2004/01/28,Intersection,Angles,At Intersection (intersection related),,0,Raining,Wet,Daylight,


### Some data is set to "unkown" which we will change to Nan (Python's default missing value marker)

In [10]:
df.replace("", np.nan, inplace = True)
df["JUNCTIONTYPE"].replace("Unkown", np.nan,inplace=True)
df["WEATHER"].replace("Unkown", np.nan,inplace=True)
df["ROADCOND"].replace("Unkown", np.nan,inplace=True)
df["LIGHTCOND"].replace("Unkown", np.nan,inplace=True)
df.head(5)

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATES,ADDRTYPE,COLLISIONTYPE,JUNCTIONTYPE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING
0,2,2,0,0,2,2013/03/27,Intersection,Angles,At Intersection (intersection related),,0,Overcast,Wet,Daylight,
1,1,2,0,0,2,2006/12/20,Block,Sideswipe,Mid-Block (not related to intersection),,0,Raining,Wet,Dark - Street Lights On,
2,1,4,0,0,3,2004/11/18,Block,Parked Car,Mid-Block (not related to intersection),,0,Overcast,Dry,Daylight,
3,1,3,0,0,3,2013/03/29,Block,Other,Mid-Block (not related to intersection),,0,Clear,Dry,Daylight,
4,2,2,0,0,2,2004/01/28,Intersection,Angles,At Intersection (intersection related),,0,Raining,Wet,Daylight,


### We will need month and year in separate columns to perform some analysis on the data later

In [11]:
df["INCMONTH"]=""
df["INCYEAR"]=""
df.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATES,ADDRTYPE,COLLISIONTYPE,JUNCTIONTYPE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,INCMONTH,INCYEAR
0,2,2,0,0,2,2013/03/27,Intersection,Angles,At Intersection (intersection related),,0,Overcast,Wet,Daylight,,,
1,1,2,0,0,2,2006/12/20,Block,Sideswipe,Mid-Block (not related to intersection),,0,Raining,Wet,Dark - Street Lights On,,,
2,1,4,0,0,3,2004/11/18,Block,Parked Car,Mid-Block (not related to intersection),,0,Overcast,Dry,Daylight,,,
3,1,3,0,0,3,2013/03/29,Block,Other,Mid-Block (not related to intersection),,0,Clear,Dry,Daylight,,,
4,2,2,0,0,2,2004/01/28,Intersection,Angles,At Intersection (intersection related),,0,Raining,Wet,Daylight,,,


In [12]:
i=0
ii=0
while i<194673:
    df["INCMONTH"][i]=datetime.strptime(df["INCDATES"][i], '%Y/%m/%d').month
    df["INCYEAR"][i]=datetime.strptime(df["INCDATES"][i], '%Y/%m/%d').year
    i=i+1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [13]:
df.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATES,ADDRTYPE,COLLISIONTYPE,JUNCTIONTYPE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,INCMONTH,INCYEAR
0,2,2,0,0,2,2013/03/27,Intersection,Angles,At Intersection (intersection related),,0,Overcast,Wet,Daylight,,3,2013
1,1,2,0,0,2,2006/12/20,Block,Sideswipe,Mid-Block (not related to intersection),,0,Raining,Wet,Dark - Street Lights On,,12,2006
2,1,4,0,0,3,2004/11/18,Block,Parked Car,Mid-Block (not related to intersection),,0,Overcast,Dry,Daylight,,11,2004
3,1,3,0,0,3,2013/03/29,Block,Other,Mid-Block (not related to intersection),,0,Clear,Dry,Daylight,,3,2013
4,2,2,0,0,2,2004/01/28,Intersection,Angles,At Intersection (intersection related),,0,Raining,Wet,Daylight,,1,2004


### Change "INCDATES" to indicate either weekday or weekend

In [14]:
#This will first change the date to indicate which day of the week the incident happened (Monday is 0 till Sunday which is 6)
#Then we change the day of the week to either weekday(1) or weekend (0)
ii=0

while ii<194673:
    df["INCDATES"][ii]=datetime.strptime(df["INCDATES"][ii], '%Y/%m/%d').weekday()
    ii=ii+1



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [15]:
df.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATES,ADDRTYPE,COLLISIONTYPE,JUNCTIONTYPE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,INCMONTH,INCYEAR
0,2,2,0,0,2,2,Intersection,Angles,At Intersection (intersection related),,0,Overcast,Wet,Daylight,,3,2013
1,1,2,0,0,2,2,Block,Sideswipe,Mid-Block (not related to intersection),,0,Raining,Wet,Dark - Street Lights On,,12,2006
2,1,4,0,0,3,3,Block,Parked Car,Mid-Block (not related to intersection),,0,Overcast,Dry,Daylight,,11,2004
3,1,3,0,0,3,4,Block,Other,Mid-Block (not related to intersection),,0,Clear,Dry,Daylight,,3,2013
4,2,2,0,0,2,2,Intersection,Angles,At Intersection (intersection related),,0,Raining,Wet,Daylight,,1,2004


### Dealing with missing Data

In [16]:
#Creating a Dataframe that will help us know how many missign values we have
missing_data=df.isnull()
missing_data.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATES,ADDRTYPE,COLLISIONTYPE,JUNCTIONTYPE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,INCMONTH,INCYEAR
0,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False
2,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False
4,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False


In [17]:
#Get the count of missing values
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")  

SEVERITYCODE
False    194673
Name: SEVERITYCODE, dtype: int64

PERSONCOUNT
False    194673
Name: PERSONCOUNT, dtype: int64

PEDCOUNT
False    194673
Name: PEDCOUNT, dtype: int64

PEDCYLCOUNT
False    194673
Name: PEDCYLCOUNT, dtype: int64

VEHCOUNT
False    194673
Name: VEHCOUNT, dtype: int64

INCDATES
False    194673
Name: INCDATES, dtype: int64

ADDRTYPE
False    192747
True       1926
Name: ADDRTYPE, dtype: int64

COLLISIONTYPE
False    189769
True       4904
Name: COLLISIONTYPE, dtype: int64

JUNCTIONTYPE
False    188344
True       6329
Name: JUNCTIONTYPE, dtype: int64

INATTENTIONIND
True     164868
False     29805
Name: INATTENTIONIND, dtype: int64

UNDERINFL
False    189789
True       4884
Name: UNDERINFL, dtype: int64

WEATHER
False    189592
True       5081
Name: WEATHER, dtype: int64

ROADCOND
False    189661
True       5012
Name: ROADCOND, dtype: int64

LIGHTCOND
False    189503
True       5170
Name: LIGHTCOND, dtype: int64

SPEEDING
True     185340
False      9333
Name: SPE

In [18]:
#JUNCTIONTYPE,WEATHER,ROADCOND & LIGHTCOND attributes have some incidents indicated as Unkown which we will need to drop.
#Here we will also drop the missing data 
df["JUNCTIONTYPE"].replace("Unknown", np.nan, inplace=True)
df.dropna(subset=["JUNCTIONTYPE"], axis=0, inplace=True)
df["WEATHER"].replace("Unknown", np.nan, inplace=True)
df.dropna(subset=["WEATHER"], axis=0, inplace=True)
df["ROADCOND"].replace("Unknown", np.nan, inplace=True)
df.dropna(subset=["ROADCOND"], axis=0, inplace=True)
df["LIGHTCOND"].replace("Unknown", np.nan, inplace=True)
df.dropna(subset=["LIGHTCOND"], axis=0, inplace=True)

In [19]:
#Here we are dropping incidents caused by inattention as this will cause noise to the relation of accidents with weather
indexNames_1 = df[ df['INATTENTIONIND'] == "Y" ].index
df.drop(indexNames_1 , inplace=True)

In [20]:
#Here we are dropping incidents caused by driving under the influence of alcohol or drugs as this will cause noise to the relation of accidents with weather
#Here we will drop incidents that have a missing information whether the driver was driving under the influence
indexNames_2 = df[ df['UNDERINFL'] == 1 ].index
df.drop(indexNames_2 , inplace=True)
df.dropna(subset=['UNDERINFL'], axis=0, inplace=True)

In [21]:
#Here we will drop incidents related to speeding as this will cause noise to the relation of accidents with weather
indexNames_3 = df[ df['SPEEDING'] == 'Y' ].index
df.drop(indexNames_3 , inplace=True)

In [22]:
#Here we will drop incidents that do not have WEATHER, ROADCOND, LIGHTCOND, ADDRTYPE or COLLISIONTYP indicated
df.dropna(subset=['WEATHER'], axis=0, inplace=True)
df.dropna(subset=['ROADCOND'], axis=0, inplace=True)
df.dropna(subset=['LIGHTCOND'], axis=0, inplace=True)
df.dropna(subset=["ADDRTYPE"], axis=0, inplace=True)
df.dropna(subset=["COLLISIONTYPE"], axis=0, inplace=True)

In [23]:
#Confirm that all the missing Data has been taken care of
missing_data=df.isnull()

#Get the count of missing values
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("") 

SEVERITYCODE
False    127359
Name: SEVERITYCODE, dtype: int64

PERSONCOUNT
False    127359
Name: PERSONCOUNT, dtype: int64

PEDCOUNT
False    127359
Name: PEDCOUNT, dtype: int64

PEDCYLCOUNT
False    127359
Name: PEDCYLCOUNT, dtype: int64

VEHCOUNT
False    127359
Name: VEHCOUNT, dtype: int64

INCDATES
False    127359
Name: INCDATES, dtype: int64

ADDRTYPE
False    127359
Name: ADDRTYPE, dtype: int64

COLLISIONTYPE
False    127359
Name: COLLISIONTYPE, dtype: int64

JUNCTIONTYPE
False    127359
Name: JUNCTIONTYPE, dtype: int64

INATTENTIONIND
True    127359
Name: INATTENTIONIND, dtype: int64

UNDERINFL
False    127359
Name: UNDERINFL, dtype: int64

WEATHER
False    127359
Name: WEATHER, dtype: int64

ROADCOND
False    127359
Name: ROADCOND, dtype: int64

LIGHTCOND
False    127359
Name: LIGHTCOND, dtype: int64

SPEEDING
True    127359
Name: SPEEDING, dtype: int64

INCMONTH
False    127359
Name: INCMONTH, dtype: int64

INCYEAR
False    127359
Name: INCYEAR, dtype: int64



In [24]:
df.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATES,ADDRTYPE,COLLISIONTYPE,JUNCTIONTYPE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,INCMONTH,INCYEAR
0,2,2,0,0,2,2,Intersection,Angles,At Intersection (intersection related),,0,Overcast,Wet,Daylight,,3,2013
1,1,2,0,0,2,2,Block,Sideswipe,Mid-Block (not related to intersection),,0,Raining,Wet,Dark - Street Lights On,,12,2006
2,1,4,0,0,3,3,Block,Parked Car,Mid-Block (not related to intersection),,0,Overcast,Dry,Daylight,,11,2004
3,1,3,0,0,3,4,Block,Other,Mid-Block (not related to intersection),,0,Clear,Dry,Daylight,,3,2013
4,2,2,0,0,2,2,Intersection,Angles,At Intersection (intersection related),,0,Raining,Wet,Daylight,,1,2004


### Dropping attributes that are not required

In [25]:
#We will not need the attributes INATTENTIONIND, UNDERINFL & SPEEDING. So we will drop them
df.drop(["INATTENTIONIND", "UNDERINFL","SPEEDING"], axis=1,inplace=True)
df.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATES,ADDRTYPE,COLLISIONTYPE,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,INCMONTH,INCYEAR
0,2,2,0,0,2,2,Intersection,Angles,At Intersection (intersection related),Overcast,Wet,Daylight,3,2013
1,1,2,0,0,2,2,Block,Sideswipe,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On,12,2006
2,1,4,0,0,3,3,Block,Parked Car,Mid-Block (not related to intersection),Overcast,Dry,Daylight,11,2004
3,1,3,0,0,3,4,Block,Other,Mid-Block (not related to intersection),Clear,Dry,Daylight,3,2013
4,2,2,0,0,2,2,Intersection,Angles,At Intersection (intersection related),Raining,Wet,Daylight,1,2004


### Convert Categorical features to numerical value

In [26]:
#We will generate code identification for al the categorical attributes 
a = df.JUNCTIONTYPE.astype('category')

b = dict(enumerate(a.cat.categories))
print (b)

c = df.WEATHER.astype('category')

d = dict(enumerate(c.cat.categories))
print (d)

e = df.ROADCOND.astype('category')

f = dict(enumerate(e.cat.categories))
print (f)

g = df.LIGHTCOND.astype('category')

j = dict(enumerate(g.cat.categories))
print (j)

h = df.ADDRTYPE.astype('category')

k = dict(enumerate(h.cat.categories))
print (k)

l = df.COLLISIONTYPE.astype('category')

m = dict(enumerate(l.cat.categories))
print (m)


{0: 'At Intersection (but not related to intersection)', 1: 'At Intersection (intersection related)', 2: 'Driveway Junction', 3: 'Mid-Block (but intersection related)', 4: 'Mid-Block (not related to intersection)', 5: 'Ramp Junction'}
{0: 'Blowing Sand/Dirt', 1: 'Clear', 2: 'Fog/Smog/Smoke', 3: 'Other', 4: 'Overcast', 5: 'Partly Cloudy', 6: 'Raining', 7: 'Severe Crosswind', 8: 'Sleet/Hail/Freezing Rain', 9: 'Snowing'}
{0: 'Dry', 1: 'Ice', 2: 'Oil', 3: 'Other', 4: 'Sand/Mud/Dirt', 5: 'Snow/Slush', 6: 'Standing Water', 7: 'Wet'}
{0: 'Dark - No Street Lights', 1: 'Dark - Street Lights Off', 2: 'Dark - Street Lights On', 3: 'Dark - Unknown Lighting', 4: 'Dawn', 5: 'Daylight', 6: 'Dusk', 7: 'Other'}
{0: 'Alley', 1: 'Block', 2: 'Intersection'}
{0: 'Angles', 1: 'Cycles', 2: 'Head On', 3: 'Left Turn', 4: 'Other', 5: 'Parked Car', 6: 'Pedestrian', 7: 'Rear Ended', 8: 'Right Turn', 9: 'Sideswipe'}


In [27]:
df["JUNCTIONTYPE"] = df["JUNCTIONTYPE"].astype('category')
df["JUNCTIONTYPE"] = df["JUNCTIONTYPE"].cat.codes
df["WEATHER"] = df["WEATHER"].astype('category')
df["WEATHER"] = df["WEATHER"].cat.codes
df["ROADCOND"] = df["ROADCOND"].astype('category')
df["ROADCOND"] = df["ROADCOND"].cat.codes
df["LIGHTCOND"] = df["LIGHTCOND"].astype('category')
df["LIGHTCOND"] = df["LIGHTCOND"].cat.codes
df["ADDRTYPE"] = df["ADDRTYPE"].astype('category')
df["ADDRTYPE"] = df["ADDRTYPE"].cat.codes
df["COLLISIONTYPE"] = df["COLLISIONTYPE"].astype('category')
df["COLLISIONTYPE"] = df["COLLISIONTYPE"].cat.codes

In [28]:
df.tail(200)

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATES,ADDRTYPE,COLLISIONTYPE,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,INCMONTH,INCYEAR
194340,1,2,0,0,2,3,2,8,1,4,7,6,1,2019
194341,2,3,0,0,2,1,2,0,1,1,0,5,12,2018
194343,2,3,0,0,2,6,2,0,1,1,0,2,1,2019
194344,2,2,0,0,2,3,2,9,1,6,7,5,12,2018
194345,1,1,0,0,1,1,1,4,4,1,0,2,12,2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194666,2,2,0,0,2,4,1,0,4,1,7,5,1,2019
194668,2,3,0,0,2,0,1,2,4,1,0,5,11,2018
194670,2,3,0,0,2,5,2,3,1,1,0,5,1,2019
194671,2,2,0,1,1,1,2,1,1,1,0,6,1,2019


### Ensure all attributes are of type "int"

In [29]:
df.dtypes

SEVERITYCODE      int64
PERSONCOUNT       int64
PEDCOUNT          int64
PEDCYLCOUNT       int64
VEHCOUNT          int64
INCDATES         object
ADDRTYPE           int8
COLLISIONTYPE      int8
JUNCTIONTYPE       int8
WEATHER            int8
ROADCOND           int8
LIGHTCOND          int8
INCMONTH         object
INCYEAR          object
dtype: object

In [30]:
df[["INCDATES"]] = df[["INCDATES"]].astype(int)
df[["INCMONTH"]] = df[["INCMONTH"]].astype(int)
df[["INCYEAR"]] = df[["INCYEAR"]].astype(int)

In [31]:
df.dtypes

SEVERITYCODE     int64
PERSONCOUNT      int64
PEDCOUNT         int64
PEDCYLCOUNT      int64
VEHCOUNT         int64
INCDATES         int64
ADDRTYPE          int8
COLLISIONTYPE     int8
JUNCTIONTYPE      int8
WEATHER           int8
ROADCOND          int8
LIGHTCOND         int8
INCMONTH         int64
INCYEAR          int64
dtype: object