# 2 Data wrangling<a id='2_Data_wrangling'></a>

## 2.2 Introduction<a id='2.2_Introduction'></a>

The City of Chicago is the largest city in the state of Illinois and the third-largest city in the United States. It is a major center for commerce, industry, transportation, and culture.  The city has over 2.7 million residents and every year, countless individuals are affected by devastating traffic accidents in the city of Chicago. Today, Chicago is focusing on improving the quality of life for the  residents by implementing the initiatives called Vision Zero aiming to prevent these tragedies. One of the city's main targets is to keep its roads safe by reducing speed-related fatal and serious injury crashes by 25%. 

### 2.2.1 Problem Statement<a id='2.2.1_Recap_Of_Data_Science_Problem'></a>

The purpose of this data science project is to come up with a traffic crash predicting model for the city of Chicago. The city of Chicago with 2.7 million residents recorded a 45% increase in traffic crash fatality in 2020 compared to 2019. Per the city's report, the cause for these fatal accidents is due to an increase in speeding within the city. For the same time period, a similar trend has been observed nationwide, however the average death rate in the city of Chicago was far worse.


This project aims to build a predictive model for fatal and serious injury crushes based on a number of crash- types (determine which ones resulted in fatal or serious injuries) boasted by neighborhoods within the city.

This model will be used to provide guidance for the City of Chicago to predict if a traffic crash will be severe/fatal helping the city to optimize allocation of it's emergency resources.


## 2.3 Imports<a id='2.3_Imports'></a>

In [3]:
#Import pandas, matplotlib.pyplot, and seaborn 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os


## 2.4 Objectives<a id='2.4_Objectives'></a>

In this section, we determine the required target value. To predict fatal and serious injury crushes, we set the INJURIES_INCAPACITATING column as target value. This feature represents "total persons sustaining incapacitating/serious injuries in the crash as determined by the reporting officer. Any injury other than fatal injury, which prevents the injured person from walking, driving, or normally continuing the activities they were capable of performing before the injury occurred. Includes severe lacerations, broken limbs, skull or chest injuries, and abdominal injuries"

The rest of the featrues are identified as potentially useful features.
 


## 2.5 Load Chicago Traffic Crashes Data<a id='2.5_Load_The_Ski_Resort_Data'></a>

Now we will start to find out if there is fundamental issues with the data

In [4]:
# the supplied CSV data file is the raw_data directory
traffic_crashes_data = pd.read_csv('../raw_data/traffic_crashes_data1.csv')

  traffic_crashes_data = pd.read_csv('../raw_data/traffic_crashes_data1.csv')


The first few records will be displayed with the info method in order to audit the data 

In [3]:
#Call the info method on traffic_crashes data to see a summary of the data
traffic_crashes_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 692206 entries, 0 to 692205
Data columns (total 49 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   CRASH_RECORD_ID                692206 non-null  object 
 1   RD_NO                          688111 non-null  object 
 2   CRASH_DATE_EST_I               52490 non-null   object 
 3   CRASH_DATE                     692206 non-null  object 
 4   POSTED_SPEED_LIMIT             692206 non-null  int64  
 5   TRAFFIC_CONTROL_DEVICE         692206 non-null  object 
 6   DEVICE_CONDITION               692206 non-null  object 
 7   WEATHER_CONDITION              692206 non-null  object 
 8   LIGHTING_CONDITION             692206 non-null  object 
 9   FIRST_CRASH_TYPE               692206 non-null  object 
 10  TRAFFICWAY_TYPE                692206 non-null  object 
 11  LANE_CNT                       198997 non-null  float64
 12  ALIGNMENT                     

In [4]:
#inspect dimension of dataframe
traffic_crashes_data.shape


(692206, 49)

In [5]:
# print the first several rows of the data
traffic_crashes_data.head(10)

Unnamed: 0,CRASH_RECORD_ID,RD_NO,CRASH_DATE_EST_I,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,...,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,79c7a2ce89f446262efd86df3d72d18b04ba487024b7c4...,JC199149,,03/25/2019 02:43:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,TURNING,...,0.0,1.0,2.0,0.0,14,2,3,41.884547,-87.641201,POINT (-87.64120093714 41.884547224337)
1,792b539deaaad65ee5b4a9691d927a34d298eb33d42af0...,JB422857,,09/05/2018 08:40:00 AM,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,ANGLE,...,0.0,0.0,2.0,0.0,8,4,9,41.968562,-87.740659,POINT (-87.740659314632 41.968562453871)
2,0115ade9a755e835255508463f7e9c4a9a0b47e9304238...,JF318029,,07/15/2022 12:45:00 AM,30,UNKNOWN,UNKNOWN,CLEAR,"DARKNESS, LIGHTED ROAD",ANGLE,...,0.0,0.0,2.0,0.0,0,6,7,41.886336,-87.716203,POINT (-87.716203130599 41.886336409761)
3,05b1982cdba5d8a00e7e76ad1ecdab0e598429f78481d2...,JF378711,,08/29/2022 11:30:00 AM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,...,0.0,0.0,3.0,0.0,11,2,8,41.749348,-87.721097,POINT (-87.721096727406 41.749348170421)
4,017040c61958d2fa977c956b2bd2d6759ef7754496dc96...,JF324552,,07/15/2022 06:50:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,...,0.0,0.0,2.0,0.0,18,6,7,41.925111,-87.667997,POINT (-87.667997321599 41.925110815832)
5,78eee027ec3dcc85d36c9e3fdae4729dcc56440105d65b...,JB291672,,06/03/2018 05:00:00 PM,30,NO CONTROLS,NO CONTROLS,CLEAR,UNKNOWN,PARKED MOTOR VEHICLE,...,0.0,0.0,1.0,0.0,17,1,6,41.910758,-87.731389,POINT (-87.731388754145 41.910757551599)
6,7943cacbae1bb60e0f056bf53bdaadc0d6092000c19167...,JF330061,,07/24/2022 07:23:00 PM,25,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,...,0.0,0.0,1.0,0.0,19,1,7,41.782639,-87.694284,POINT (-87.694283539975 41.782638841451)
7,01aaa759c6bbefd0f584226fbd88bdc549de3ed1e46255...,JF319819,,07/15/2022 05:10:00 PM,40,NO CONTROLS,NO CONTROLS,CLOUDY/OVERCAST,DAYLIGHT,ANGLE,...,0.0,0.0,2.0,0.0,17,6,7,41.975826,-87.65042,POINT (-87.650419778017 41.975826016449)
8,7b1537e0a3e166f7542afe24eefca6fcff71061433323d...,JA252488,,05/05/2017 01:00:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,SIDESWIPE OPPOSITE DIRECTION,...,0.0,0.0,2.0,0.0,13,6,5,41.946332,-87.739157,POINT (-87.739156849826 41.946332274015)
9,011f2a8e3d1943e19d92862ab066becc8bcedc8e084b0a...,JF408563,Y,09/23/2022 08:00:00 PM,24,NO CONTROLS,NO CONTROLS,UNKNOWN,UNKNOWN,PARKED MOTOR VEHICLE,...,0.0,0.0,1.0,0.0,20,6,9,41.844149,-87.712489,POINT (-87.71248896045 41.844149372172)


The output above shows plausible column headings. We can already see missing value in the `CRASH_DATE_EST__I` column.

## 2.6 Explore The Data<a id='2.6_Explore_The_Data'></a>

### 2.6.1 Target Column<a id='2.6.1_Target_Column'></a>

The data of interest is called INJURIES_INCAPACITATING. Check if it has missing values:

In [6]:
#Filter the traffic_data dataframe to display just the row for the crashes with incapacitating injuries
#The transpose of the row will give a better output
traffic_crashes_data[traffic_crashes_data.INJURIES_INCAPACITATING > 0].T

Unnamed: 0,85,116,207,413,617,652,728,795,814,881,...,691766,691796,691855,691857,691866,692005,692055,692135,692150,692174
CRASH_RECORD_ID,7afafeab6835a895e20dbfe7747c0cd81d0a7afd94cd2c...,332e7008a34f5fb711b3cea3bceeb98f0d8ec8d698540d...,7a6121441a3a8d441e9788c3db14750ee8cbe321318ef7...,639a113b84e70a07e4e52be54e9c0770b9d1fd03dc1d4e...,d43aa82641db9ddf3072cac967d377ee09f395a32cae2b...,7bcc1fffb568845967c65e32de8ea2f6d9c7ac7b108e30...,c45fda8e86b2c5cc73eefd2515d9ef23935b0caf496d3c...,790748a25825884c14d17cbbfb24c41fdf99eb9a7854fa...,7c38c494e3ce49ed072e9479c2f78bef5d22b679281055...,79628ebfa0725dab6c980ca74898dc7ee7a76c6b752667...,...,27ff457f4c03f9e6f48e2548d6bd47c0a664c1139c6537...,16ea556aa2bfdc9b1ac8ea58d81351d46032e64101cd6e...,5b83a89db9721a2966cf0e8cfbc42142eb8174f47d2c2e...,d3711ab518e57f50c5566073be6600e634a3f3001ff718...,22632e2e4391814c2f355404bebd863d5da001bf6de034...,1244bf96aba32ef310a05afd0a090511db5b27badae722...,c1958a5cb87552b03c0bbf2f8d8b78707cad5237bd31cc...,6b6f5ceb4053bfbb3483fb453231caa94ff2351bde4c9d...,89ede1c084f73bcda4ceb5cba3218cec65736af207e9a1...,1b2d8d20409cbaaad21ab689a6c73e5914f8e53605f77b...
RD_NO,JC319055,JF375267,JB464959,JF318418,JF375571,JD275499,JF318066,JD233450,JA517553,JE369874,...,JF455134,JF455834,JF454664,JF454582,JC475340,JF456236,JF456579,JC483473,JF485926,JF390161
CRASH_DATE_EST_I,,,,,,,,Y,,,...,,,,,,,,,,
CRASH_DATE,06/23/2019 09:45:00 PM,08/29/2022 04:30:00 PM,10/06/2018 02:29:00 PM,07/15/2022 11:10:00 AM,08/29/2022 11:50:00 PM,06/25/2020 07:21:00 AM,07/15/2022 01:17:00 AM,05/15/2020 07:20:00 PM,11/19/2017 05:37:00 AM,09/11/2021 03:50:00 PM,...,10/30/2022 10:19:00 AM,10/30/2022 09:25:00 PM,10/29/2022 10:24:00 PM,10/29/2022 08:26:00 PM,10/16/2019 06:45:00 PM,10/31/2022 08:57:00 AM,10/02/2022 04:43:00 AM,10/23/2019 01:32:00 PM,11/24/2022 01:45:00 AM,09/10/2022 01:18:00 AM
POSTED_SPEED_LIMIT,30,30,30,25,30,25,30,30,15,30,...,35,30,30,30,25,30,15,30,35,30
TRAFFIC_CONTROL_DEVICE,STOP SIGN/FLASHER,NO CONTROLS,NO CONTROLS,PEDESTRIAN CROSSING SIGN,TRAFFIC SIGNAL,STOP SIGN/FLASHER,NO CONTROLS,NO CONTROLS,STOP SIGN/FLASHER,TRAFFIC SIGNAL,...,TRAFFIC SIGNAL,TRAFFIC SIGNAL,NO CONTROLS,TRAFFIC SIGNAL,NO CONTROLS,TRAFFIC SIGNAL,NO CONTROLS,TRAFFIC SIGNAL,NO CONTROLS,FLASHING CONTROL SIGNAL
DEVICE_CONDITION,FUNCTIONING PROPERLY,NO CONTROLS,NO CONTROLS,FUNCTIONING PROPERLY,FUNCTIONING PROPERLY,NO CONTROLS,NO CONTROLS,NO CONTROLS,FUNCTIONING PROPERLY,FUNCTIONING IMPROPERLY,...,FUNCTIONING PROPERLY,FUNCTIONING PROPERLY,NO CONTROLS,FUNCTIONING PROPERLY,NO CONTROLS,FUNCTIONING PROPERLY,NO CONTROLS,FUNCTIONING PROPERLY,NO CONTROLS,FUNCTIONING PROPERLY
WEATHER_CONDITION,CLOUDY/OVERCAST,CLEAR,CLOUDY/OVERCAST,RAIN,CLEAR,CLEAR,CLEAR,CLEAR,CLEAR,CLEAR,...,CLEAR,CLOUDY/OVERCAST,CLEAR,CLEAR,CLEAR,CLOUDY/OVERCAST,UNKNOWN,CLEAR,CLEAR,CLEAR
LIGHTING_CONDITION,DARKNESS,DAYLIGHT,DAYLIGHT,DAYLIGHT,DARKNESS,DAYLIGHT,"DARKNESS, LIGHTED ROAD",DAWN,"DARKNESS, LIGHTED ROAD",DAYLIGHT,...,DAYLIGHT,"DARKNESS, LIGHTED ROAD","DARKNESS, LIGHTED ROAD","DARKNESS, LIGHTED ROAD",DAWN,DAYLIGHT,UNKNOWN,DAYLIGHT,"DARKNESS, LIGHTED ROAD","DARKNESS, LIGHTED ROAD"
FIRST_CRASH_TYPE,PEDESTRIAN,REAR END,PEDALCYCLIST,PEDESTRIAN,PARKED MOTOR VEHICLE,ANGLE,PEDESTRIAN,PEDESTRIAN,SIDESWIPE OPPOSITE DIRECTION,TURNING,...,ANGLE,ANGLE,HEAD ON,PEDESTRIAN,PEDALCYCLIST,PEDESTRIAN,PEDESTRIAN,PEDESTRIAN,OTHER OBJECT,PEDESTRIAN


The column INJURIES_INCAPACITATING  appear to have any missing values.

### 2.6.2 Number Of Missing Values By Column<a id='2.6.2_Number_Of_Missing_Values_By_Column'></a>

Now we determine which columns have the most missing values. 
Method used: Count the number of missing values in each column and sort them.

In [8]:
missing = pd.concat([traffic_crashes_data.isnull().sum(), 100 * traffic_crashes_data.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count', ascending=False)

Unnamed: 0,count,%
WORKERS_PRESENT_I,691158,99.8486
DOORING_I,690078,99.692577
WORK_ZONE_TYPE,689037,99.542188
WORK_ZONE_I,688161,99.415636
PHOTOS_TAKEN_I,683626,98.760485
STATEMENTS_TAKEN_I,677725,97.907993
NOT_RIGHT_OF_WAY_I,659749,95.311078
CRASH_DATE_EST_I,639716,92.416997
INTERSECTION_RELATED_I,533539,77.078066
LANE_CNT,493209,71.251766


WORKERS_PRESENT_I, DOORING_I, WORK_ZONE_TYPE, WORK_ZONE have the most missing values, at 99% followed by the column PHOTOS_TAKEN_I, STATEMENTS_TAKEN_I, NOT_RIGHT_OF_WAY_I, CRASH_DATE_EST_I with a missing value of 92%-97%. The columns INTERSECTION_RELATED_I, LANE_CNT, HIT_AND_RUN_I have a missing rate of 70%. 
The dataset contains 692,206 rows & 49 columns. Based on the investigaion, there are columns with more than 60% missing values. It is not feasable to impute missing values for those columns, thus We will remove the columns.

In [9]:
def removeNulls(dataframe, axis =1, percent=0.6):
    '''
    * removeNull function will remove the rows and columns based on parameters provided.
    * dataframe : Name of the dataframe  
    * axis      : axis = 0 defines drop rows, axis =1(default) defines drop columns    
    * percent   : percent of data where column/rows values are null,default is 0.6(60%)
              
    '''
    df = dataframe.copy()
    ishape = df.shape
    if axis == 0:
        rownames = df.transpose().isnull().sum()
        rownames = list(rownames[rownames.values > percent*len(df)].index)
        df.drop(df.index[rownames],inplace=True) 
        print("\nNumber of Rows removed\t: ",len(rownames))
    else:
        colnames = (df.isnull().sum()/len(df))
        colnames = list(colnames[colnames.values>=percent].index)
        df.drop(labels = colnames,axis =1,inplace=True)        
        print("Number of Columns removed\t: ",len(colnames))
        
    print("\nRaw dataset rows,columns",ishape,"\nClean dataset rows,columns",df.shape)

    return df



In [10]:
# Remove columns where NA values are more than or equal to 60%
traffic_crashes_data1 = removeNulls(traffic_crashes_data, axis =1,percent = 0.6)


Number of Columns removed	:  11

Raw dataset rows,columns (692206, 49) 
Clean dataset rows,columns (692206, 38)


In [11]:
# Remove rows where NA values are more than or equal to 60%
traffic_crashes_data2 = removeNulls(traffic_crashes_data1, axis =0,percent = 0.6)


Number of Rows removed	:  0

Raw dataset rows,columns (692206, 38) 
Clean dataset rows,columns (692206, 38)


In [14]:
#dropped half of the dataset because it was too large
traffic_crashes_data3=traffic_crashes_data2.drop(traffic_crashes_data2.index[346103:692206])

In [15]:
traffic_crashes_data3.shape

(346103, 38)

### 2.6.3 Categorical Features<a id='2.6.3_Categorical_Features'></a>

So far we've examined only the numeric features. Now we will inspect categorical features, i.e exmine each column from solution/outcome perspective to determine if it is required or not for the analysis. in this seciton we will focus on removing irrelevant colmuns.

In [17]:
#Use traffic_crashes_data `select_dtypes` method to select columns of dtype 'object'
traffic_crashes_data3.select_dtypes('object')

Unnamed: 0,CRASH_RECORD_ID,RD_NO,CRASH_DATE,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,ALIGNMENT,...,REPORT_TYPE,CRASH_TYPE,DAMAGE,DATE_POLICE_NOTIFIED,PRIM_CONTRIBUTORY_CAUSE,SEC_CONTRIBUTORY_CAUSE,STREET_DIRECTION,STREET_NAME,MOST_SEVERE_INJURY,LOCATION
0,79c7a2ce89f446262efd86df3d72d18b04ba487024b7c4...,JC199149,03/25/2019 02:43:00 PM,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,TURNING,ONE-WAY,STRAIGHT AND LEVEL,...,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,"OVER $1,500",03/25/2019 03:17:00 PM,IMPROPER TURNING/NO SIGNAL,DRIVING SKILLS/KNOWLEDGE/EXPERIENCE,W,RANDOLPH ST,"REPORTED, NOT EVIDENT",POINT (-87.64120093714 41.884547224337)
1,792b539deaaad65ee5b4a9691d927a34d298eb33d42af0...,JB422857,09/05/2018 08:40:00 AM,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,ANGLE,NOT DIVIDED,STRAIGHT AND LEVEL,...,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"OVER $1,500",09/05/2018 09:00:00 AM,"VISION OBSCURED (SIGNS, TREE LIMBS, BUILDINGS,...",FAILING TO YIELD RIGHT-OF-WAY,N,ELSTON AVE,NO INDICATION OF INJURY,POINT (-87.740659314632 41.968562453871)
2,0115ade9a755e835255508463f7e9c4a9a0b47e9304238...,JF318029,07/15/2022 12:45:00 AM,UNKNOWN,UNKNOWN,CLEAR,"DARKNESS, LIGHTED ROAD",ANGLE,NOT DIVIDED,STRAIGHT AND LEVEL,...,ON SCENE,NO INJURY / DRIVE AWAY,"OVER $1,500",07/15/2022 12:50:00 AM,UNABLE TO DETERMINE,UNABLE TO DETERMINE,N,CENTRAL PARK AVE,NO INDICATION OF INJURY,POINT (-87.716203130599 41.886336409761)
3,05b1982cdba5d8a00e7e76ad1ecdab0e598429f78481d2...,JF378711,08/29/2022 11:30:00 AM,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,FOUR WAY,STRAIGHT AND LEVEL,...,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"$501 - $1,500",09/01/2022 11:30:00 AM,DISREGARDING TRAFFIC SIGNALS,NOT APPLICABLE,W,79TH ST,NO INDICATION OF INJURY,POINT (-87.721096727406 41.749348170421)
4,017040c61958d2fa977c956b2bd2d6759ef7754496dc96...,JF324552,07/15/2022 06:50:00 PM,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,...,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"OVER $1,500",07/20/2022 11:00:00 AM,UNABLE TO DETERMINE,UNABLE TO DETERMINE,N,ASHLAND AVE,NO INDICATION OF INJURY,POINT (-87.667997321599 41.925110815832)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
346098,6884687565db1423ba8db471bd87cca37fb822af838056...,JB364956,07/25/2018 04:30:00 PM,UNKNOWN,UNKNOWN,CLEAR,DAYLIGHT,REAR END,ONE-WAY,STRAIGHT AND LEVEL,...,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,"$501 - $1,500",07/25/2018 05:19:00 PM,FOLLOWING TOO CLOSELY,NOT APPLICABLE,W,WELLINGTON AVE,NO INDICATION OF INJURY,POINT (-87.756681654867 41.935159892552)
346099,6a6a92c6a6b60dc55031cdf0775357ff48c88065d3ffb6...,JD462068,12/17/2020 09:11:00 PM,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,DIVIDED - W/MEDIAN BARRIER,STRAIGHT AND LEVEL,...,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,"OVER $1,500",12/17/2020 09:25:00 PM,UNABLE TO DETERMINE,UNABLE TO DETERMINE,N,ASHLAND AVE,NONINCAPACITATING INJURY,POINT (-87.669023859698 41.958607647297)
346100,69317f025f70f8a764ca47fc3d2917df8311ce8c327959...,JC113910,01/11/2019 10:00:00 PM,NO CONTROLS,NO CONTROLS,UNKNOWN,UNKNOWN,PARKED MOTOR VEHICLE,NOT DIVIDED,STRAIGHT AND LEVEL,...,ON SCENE,NO INJURY / DRIVE AWAY,"OVER $1,500",01/12/2019 10:01:00 AM,UNABLE TO DETERMINE,NOT APPLICABLE,W,51ST ST,NO INDICATION OF INJURY,POINT (-87.698819342995 41.800795760859)
346101,687861c5d61e8f2ae6d4a3d881b8f6396b99b92cb29e79...,JD386991,10/03/2020 02:10:00 AM,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",TURNING,NOT DIVIDED,STRAIGHT AND LEVEL,...,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,"OVER $1,500",10/03/2020 02:25:00 AM,IMPROPER TURNING/NO SIGNAL,DRIVING SKILLS/KNOWLEDGE/EXPERIENCE,S,STATE ST,NONINCAPACITATING INJURY,POINT (-87.624869685896 41.756400406673)


In [18]:
#looking into LIGHTING_CONDITION because it has darkness, lighted road as value for one record
lighting_condition_stat=traffic_crashes_data3.groupby('LIGHTING_CONDITION') ['LIGHTING_CONDITION'].agg('count').sort_values(ascending=False)
lighting_condition_stat

LIGHTING_CONDITION
DAYLIGHT                  224678
DARKNESS, LIGHTED ROAD     75162
DARKNESS                   16399
UNKNOWN                    14058
DUSK                        9935
DAWN                        5871
Name: LIGHTING_CONDITION, dtype: int64

The column LIGHTING_CONDITION describes the conditon of light  at time of crashhas. The column contians over 75,000 rows with the value "DARKNESS, LIGHTED ROAD" and 16,000 rows with the value "DARKNESS". To have a clear difference between the two values, we will change "DARKNESS, LIGHTED ROAD" to "POORLY LIT".

In [19]:
traffic_crashes_data3['LIGHTING_CONDITION'] = traffic_crashes_data3['LIGHTING_CONDITION'] .replace('DARKNESS, LIGHTED ROAD','POORLY LIT')
traffic_crashes_data3.head()

Unnamed: 0,CRASH_RECORD_ID,RD_NO,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,...,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,79c7a2ce89f446262efd86df3d72d18b04ba487024b7c4...,JC199149,03/25/2019 02:43:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,TURNING,ONE-WAY,...,0.0,1.0,2.0,0.0,14,2,3,41.884547,-87.641201,POINT (-87.64120093714 41.884547224337)
1,792b539deaaad65ee5b4a9691d927a34d298eb33d42af0...,JB422857,09/05/2018 08:40:00 AM,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,ANGLE,NOT DIVIDED,...,0.0,0.0,2.0,0.0,8,4,9,41.968562,-87.740659,POINT (-87.740659314632 41.968562453871)
2,0115ade9a755e835255508463f7e9c4a9a0b47e9304238...,JF318029,07/15/2022 12:45:00 AM,30,UNKNOWN,UNKNOWN,CLEAR,POORLY LIT,ANGLE,NOT DIVIDED,...,0.0,0.0,2.0,0.0,0,6,7,41.886336,-87.716203,POINT (-87.716203130599 41.886336409761)
3,05b1982cdba5d8a00e7e76ad1ecdab0e598429f78481d2...,JF378711,08/29/2022 11:30:00 AM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,FOUR WAY,...,0.0,0.0,3.0,0.0,11,2,8,41.749348,-87.721097,POINT (-87.721096727406 41.749348170421)
4,017040c61958d2fa977c956b2bd2d6759ef7754496dc96...,JF324552,07/15/2022 06:50:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,...,0.0,0.0,2.0,0.0,18,6,7,41.925111,-87.667997,POINT (-87.667997321599 41.925110815832)


In [22]:
#looking into Crash_type
crash_type_stat=traffic_crashes_data3.groupby('CRASH_TYPE') ['CRASH_TYPE'].agg('count').sort_values(ascending=False)
crash_type_stat



CRASH_TYPE
NO INJURY / DRIVE AWAY              255334
INJURY AND / OR TOW DUE TO CRASH     90769
Name: CRASH_TYPE, dtype: int64

#### 2.6.3.1 Unique  Names<a id='2.6.3.1_Unique_Names'></a>

In [27]:
#Use pandas' Series method `value_counts` to find any duplicated crash records
traffic_crashes_data3['CRASH_RECORD_ID'].value_counts().head()

79c7a2ce89f446262efd86df3d72d18b04ba487024b7c42d58be7bc0ee3b2779be1916679231382b4a4bfe14200bd305d9c6feb7cd70839f863dd944b040212d    1
39506a83c52d6812f105eac689571207620ba8f9f10c4b010f6643e68fecfb881e8a79b1eff86aac15d920b810c3e917a832616e2db74c2544566f0e75cf2106    1
3c63602c082a9217a06843afa2d948e70b436d98a47915882895dafbb85be2b9af31864f7b9881f56dc52c7fa9454a51331473bd3728fa444dd5229a4a52c03f    1
3c8c9be2edac9ed96bc564c750b64b533e756f8f09aec13d15b34a3205951010bad50bfe93c324235ceec001bd1805f531d6a9c9cddd7319120d40b645b7c17f    1
3c1e0264962a5d9fef9e8a20c2690ff2b9c7400e866a5970c55c29d8c63fe1b410828ca9cfd366aed1ac11d6663459c6d4a2be6b318bf7ecf1b36170b4b0d687    1
Name: CRASH_RECORD_ID, dtype: int64

The crash_record_id serves as a unique ID and does not show any duplicate records. 

#### 2.6.3.2 Remove irrelevant features<a id='2.6.3.2_Remove_Irrelevant_Features'></a>

In [28]:
traffic_crashes_data3.head(5)

Unnamed: 0,CRASH_RECORD_ID,RD_NO,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,...,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,79c7a2ce89f446262efd86df3d72d18b04ba487024b7c4...,JC199149,03/25/2019 02:43:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,TURNING,ONE-WAY,...,0.0,1.0,2.0,0.0,14,2,3,41.884547,-87.641201,POINT (-87.64120093714 41.884547224337)
1,792b539deaaad65ee5b4a9691d927a34d298eb33d42af0...,JB422857,09/05/2018 08:40:00 AM,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,ANGLE,NOT DIVIDED,...,0.0,0.0,2.0,0.0,8,4,9,41.968562,-87.740659,POINT (-87.740659314632 41.968562453871)
2,0115ade9a755e835255508463f7e9c4a9a0b47e9304238...,JF318029,07/15/2022 12:45:00 AM,30,UNKNOWN,UNKNOWN,CLEAR,POORLY LIT,ANGLE,NOT DIVIDED,...,0.0,0.0,2.0,0.0,0,6,7,41.886336,-87.716203,POINT (-87.716203130599 41.886336409761)
3,05b1982cdba5d8a00e7e76ad1ecdab0e598429f78481d2...,JF378711,08/29/2022 11:30:00 AM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,FOUR WAY,...,0.0,0.0,3.0,0.0,11,2,8,41.749348,-87.721097,POINT (-87.721096727406 41.749348170421)
4,017040c61958d2fa977c956b2bd2d6759ef7754496dc96...,JF324552,07/15/2022 06:50:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,...,0.0,0.0,2.0,0.0,18,6,7,41.925111,-87.667997,POINT (-87.667997321599 41.925110815832)


In [33]:
speed_limit_stat=traffic_crashes_data3.groupby('POSTED_SPEED_LIMIT') ['POSTED_SPEED_LIMIT'].agg('count').sort_values(ascending=False)
speed_limit_stat

POSTED_SPEED_LIMIT
30    254323
35     23319
25     21720
20     14266
15     12341
10      7907
0       3769
40      3320
45      2210
5       2201
55       316
3         91
50        78
9         41
99        33
39        31
1         20
24        19
60        18
2         10
65        10
32         9
11         8
34         8
33         7
7          4
12         3
36         3
6          3
26         2
14         2
22         2
70         1
63         1
62         1
38         1
4          1
49         1
23         1
31         1
29         1
Name: POSTED_SPEED_LIMIT, dtype: int64

In [34]:
traffic_crashes_data3.groupby('TRAFFIC_CONTROL_DEVICE') ['TRAFFIC_CONTROL_DEVICE'].agg('count')

TRAFFIC_CONTROL_DEVICE
BICYCLE CROSSING SIGN           10
DELINEATORS                    142
FLASHING CONTROL SIGNAL        117
LANE USE MARKING               644
NO CONTROLS                 199419
NO PASSING                      19
OTHER                         2182
OTHER RAILROAD CROSSING         79
OTHER REG. SIGN                369
PEDESTRIAN CROSSING SIGN       184
POLICE/FLAGMAN                 130
RAILROAD CROSSING GATE         224
RR CROSSING SIGN                46
SCHOOL ZONE                    137
STOP SIGN/FLASHER            34105
TRAFFIC SIGNAL               95527
UNKNOWN                      11978
YIELD                          502
Name: TRAFFIC_CONTROL_DEVICE, dtype: int64

In [35]:
traffic_crashes_data3.groupby('DEVICE_CONDITION') ['DEVICE_CONDITION'].agg('count')

DEVICE_CONDITION
FUNCTIONING IMPROPERLY        1654
FUNCTIONING PROPERLY        118678
MISSING                         35
NO CONTROLS                 201590
NOT FUNCTIONING               1104
OTHER                         2576
UNKNOWN                      20323
WORN REFLECTIVE MATERIAL       143
Name: DEVICE_CONDITION, dtype: int64

Both the TRAFFIC_CONTROL_DEVICE and DEVICE_CONDITION have missing or unknown values that cannot be replaced

In [36]:
traffic_crashes_data3.isnull().sum()

CRASH_RECORD_ID                     0
RD_NO                               0
CRASH_DATE                          0
POSTED_SPEED_LIMIT                  0
TRAFFIC_CONTROL_DEVICE              0
DEVICE_CONDITION                    0
WEATHER_CONDITION                   0
LIGHTING_CONDITION                  0
FIRST_CRASH_TYPE                    0
TRAFFICWAY_TYPE                     0
ALIGNMENT                           0
ROADWAY_SURFACE_COND                0
ROAD_DEFECT                         0
REPORT_TYPE                      8846
CRASH_TYPE                          0
DAMAGE                              0
DATE_POLICE_NOTIFIED                0
PRIM_CONTRIBUTORY_CAUSE             0
SEC_CONTRIBUTORY_CAUSE              0
STREET_NO                           0
STREET_DIRECTION                    2
STREET_NAME                         0
BEAT_OF_OCCURRENCE                  3
NUM_UNITS                           0
MOST_SEVERE_INJURY                761
INJURIES_TOTAL                    754
INJURIES_FAT

RD_NO (Chicago Police Department report number) is not relevant for this project, thus this column will be dropped.

In [37]:
irrelevant_columns = ['RD_NO']
traffic_crashes_data3.drop(labels = irrelevant_columns, axis =1, inplace=True)


In [38]:
traffic_crashes_data3.head()

Unnamed: 0,CRASH_RECORD_ID,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,ALIGNMENT,...,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,79c7a2ce89f446262efd86df3d72d18b04ba487024b7c4...,03/25/2019 02:43:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,TURNING,ONE-WAY,STRAIGHT AND LEVEL,...,0.0,1.0,2.0,0.0,14,2,3,41.884547,-87.641201,POINT (-87.64120093714 41.884547224337)
1,792b539deaaad65ee5b4a9691d927a34d298eb33d42af0...,09/05/2018 08:40:00 AM,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,ANGLE,NOT DIVIDED,STRAIGHT AND LEVEL,...,0.0,0.0,2.0,0.0,8,4,9,41.968562,-87.740659,POINT (-87.740659314632 41.968562453871)
2,0115ade9a755e835255508463f7e9c4a9a0b47e9304238...,07/15/2022 12:45:00 AM,30,UNKNOWN,UNKNOWN,CLEAR,POORLY LIT,ANGLE,NOT DIVIDED,STRAIGHT AND LEVEL,...,0.0,0.0,2.0,0.0,0,6,7,41.886336,-87.716203,POINT (-87.716203130599 41.886336409761)
3,05b1982cdba5d8a00e7e76ad1ecdab0e598429f78481d2...,08/29/2022 11:30:00 AM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,FOUR WAY,STRAIGHT AND LEVEL,...,0.0,0.0,3.0,0.0,11,2,8,41.749348,-87.721097,POINT (-87.721096727406 41.749348170421)
4,017040c61958d2fa977c956b2bd2d6759ef7754496dc96...,07/15/2022 06:50:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,...,0.0,0.0,2.0,0.0,18,6,7,41.925111,-87.667997,POINT (-87.667997321599 41.925110815832)


Looking at the dataframe above, it looksl like the columns "INJURIES_NON_INCAPACITATING" 
and "INJURIES_REPORTED_NOT_EVIDENT" show only null vlaue.
Next, we will examine both columns closely.

In [40]:
traffic_crashes_data3.groupby('INJURIES_NON_INCAPACITATING') ['INJURIES_NON_INCAPACITATING'].agg('count')

INJURIES_NON_INCAPACITATING
0.0     317979
1.0      21462
2.0       4044
3.0       1210
4.0        415
5.0        148
6.0         56
7.0         17
8.0         10
9.0          2
10.0         1
12.0         1
14.0         1
18.0         1
19.0         1
21.0         1
Name: INJURIES_NON_INCAPACITATING, dtype: int64

In [41]:
traffic_crashes_data3.groupby('INJURIES_REPORTED_NOT_EVIDENT') ['INJURIES_REPORTED_NOT_EVIDENT'].agg('count')

INJURIES_REPORTED_NOT_EVIDENT
0.0     329687
1.0      12205
2.0       2487
3.0        650
4.0        202
5.0         78
6.0         19
7.0          8
8.0          5
9.0          4
10.0         3
15.0         1
Name: INJURIES_REPORTED_NOT_EVIDENT, dtype: int64

Now, we will examine the column "ROADWAY_SURFACE_COND". 
This is relevant feature for creating the crash predicting model, so we have to make sure the 
column contains relevant information

In [45]:
traffic_crashes_data3['ROADWAY_SURFACE_COND'].unique()

array(['DRY', 'UNKNOWN', 'SNOW OR SLUSH', 'WET', 'ICE', 'OTHER',
       'SAND, MUD, DIRT'], dtype=object)

In [46]:
traffic_crashes_data3.groupby('ROADWAY_SURFACE_COND') ['ROADWAY_SURFACE_COND'].agg('count')

ROADWAY_SURFACE_COND
DRY                258786
ICE                  2475
OTHER                 815
SAND, MUD, DIRT       123
SNOW OR SLUSH       12637
UNKNOWN             25863
WET                 45404
Name: ROADWAY_SURFACE_COND, dtype: int64

In [47]:
traffic_crashes_data3['CRASH_DATE'].unique()

array(['03/25/2019 02:43:00 PM', '09/05/2018 08:40:00 AM',
       '07/15/2022 12:45:00 AM', ..., '12/17/2020 09:11:00 PM',
       '10/03/2020 02:10:00 AM', '12/06/2015 02:30:00 AM'], dtype=object)

In [48]:
#remove year from crash_Date column
traffic_crashes_data3['CRASH_YEAR'] = traffic_crashes_data3['CRASH_DATE'].apply(lambda x: x[6:10]) 
traffic_crashes_data3.head()



Unnamed: 0,CRASH_RECORD_ID,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,ALIGNMENT,...,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION,CRASH_YEAR
0,79c7a2ce89f446262efd86df3d72d18b04ba487024b7c4...,03/25/2019 02:43:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,TURNING,ONE-WAY,STRAIGHT AND LEVEL,...,1.0,2.0,0.0,14,2,3,41.884547,-87.641201,POINT (-87.64120093714 41.884547224337),2019
1,792b539deaaad65ee5b4a9691d927a34d298eb33d42af0...,09/05/2018 08:40:00 AM,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,ANGLE,NOT DIVIDED,STRAIGHT AND LEVEL,...,0.0,2.0,0.0,8,4,9,41.968562,-87.740659,POINT (-87.740659314632 41.968562453871),2018
2,0115ade9a755e835255508463f7e9c4a9a0b47e9304238...,07/15/2022 12:45:00 AM,30,UNKNOWN,UNKNOWN,CLEAR,POORLY LIT,ANGLE,NOT DIVIDED,STRAIGHT AND LEVEL,...,0.0,2.0,0.0,0,6,7,41.886336,-87.716203,POINT (-87.716203130599 41.886336409761),2022
3,05b1982cdba5d8a00e7e76ad1ecdab0e598429f78481d2...,08/29/2022 11:30:00 AM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,FOUR WAY,STRAIGHT AND LEVEL,...,0.0,3.0,0.0,11,2,8,41.749348,-87.721097,POINT (-87.721096727406 41.749348170421),2022
4,017040c61958d2fa977c956b2bd2d6759ef7754496dc96...,07/15/2022 06:50:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,STRAIGHT AND LEVEL,...,0.0,2.0,0.0,18,6,7,41.925111,-87.667997,POINT (-87.667997321599 41.925110815832),2022


In [53]:
# save the data to a new csv file
datapath = '../data'
save_file(traffic_crashes_data, 'traffic_crashes_data_cleaned.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "../data\traffic_crashes_data_cleaned.csv"
