## Counting crashes

This notebook is for counting crashes per intersection and segment. In the notebook, we will follow chapters like below: <br>
- 1. Counting crashes
- 2. Counting crahses per month
- 3. Counting crashes per hour in the day <br>

From this notebook, we will export geojson files that contain the number of crashes per each intersections and segments. A map in the dashboard (https://workzone-collision-analysis.github.io/capstone/dashboard/) was made with the geojson file. <br>
Also we will export csv files about the number of crasehs per month and the time in a day on certain segment and intersections too. Charts in the dashboard was drawn based on this datasets.

In [1]:
# import libraries
import pandas as pd
import numpy as np
import geopandas as gpd

## 1. Counting Crashes

### 1-1. Short segment

In [2]:
# make sure that you run the 'shst_02_extract_short_segments' notebook to get the 'shst_short_segment_centroid.shp'
# import a shapefile of the short segment centroid
gdf_short_segment = gpd.read_file('../data/sharedstreets_geometry/short_segment/shst_short_segment_centroid.shp')

In [3]:
# drop unnecessary columns
gdf_short_segment = gdf_short_segment[['id','geometry']]

In [4]:
# make sure that you run the 'crash_02_define_Intersection_crash' notebook to get the 'crash_short_segment.shpp'
# import crash dataset
gdf_crash_short_segment = gpd.read_file('../data/cleaned_data/crash_seperated/crash_short_segment/crash_short_segment.shp')

In [5]:
# drop unnecessary columns and rename
gdf_crash_short_segment = gdf_crash_short_segment.rename(columns={'collision_':'collision_id'})[['collision_id',
                                                                                                'nearest_id',
                                                                                                'crash_date',
                                                                                                'crash_time']]

In [6]:
gdf_crash_short_segment.head()

Unnamed: 0,collision_id,nearest_id,crash_date,crash_time
0,3530528,e95045cba4602b73447ed579683d85ae,2016-10-01,03:18:00
1,3553471,e95045cba4602b73447ed579683d85ae,2016-11-02,05:54:00
2,3562630,e95045cba4602b73447ed579683d85ae,2016-11-17,00:00:00
3,3585815,e95045cba4602b73447ed579683d85ae,2016-12-23,11:45:00
4,3589594,e95045cba4602b73447ed579683d85ae,2016-12-30,00:30:00


In [7]:
# group by 'nearest_id' ('nearest_id' is an Sharedstreet geometry id of the short segment) and count the crashes
gdf_crash_count_short_segment = gdf_crash_short_segment[['nearest_id','collision_id']].groupby('nearest_id', as_index=False).count()

In [8]:
# rename a column
gdf_crash_count_short_segment = gdf_crash_count_short_segment.rename(columns={'collision_id':'count'})

In [9]:
gdf_crash_count_short_segment.head()

Unnamed: 0,nearest_id,count
0,00010fd3ee560483c21bb98e414741c7,52
1,0006da955dbe286a729ac6847ec22e6f,35
2,00520114b0a7f9d36eafa7e42f03196e,27
3,0059632c4bd2f573e9c2beed50983686,19
4,007c733edcc1bb5d03125cf15a69cf0d,1


In [10]:
# merge two datasets
gdf_short_segment = gdf_short_segment.merge(gdf_crash_count_short_segment, left_on='id', right_on='nearest_id', how='left')

In [11]:
gdf_short_segment

Unnamed: 0,id,geometry,nearest_id,count
0,22e99b05de3720ee9abcae8b330251ca,POINT (-73.92000 40.64622),22e99b05de3720ee9abcae8b330251ca,2.0
1,aa26db6c30765ab841c89d8a4480ac05,POINT (-73.91664 40.64195),aa26db6c30765ab841c89d8a4480ac05,37.0
2,d15b64f9e9cf8e2b78b48d0f6a052d4d,POINT (-73.91742 40.64151),d15b64f9e9cf8e2b78b48d0f6a052d4d,8.0
3,0325e4c47ae1f644a9c1389acd663273,POINT (-73.91895 40.64058),0325e4c47ae1f644a9c1389acd663273,31.0
4,d52b58d1e43c0e2f5b8cfed941a7b7da,POINT (-73.91818 40.64107),d52b58d1e43c0e2f5b8cfed941a7b7da,6.0
...,...,...,...,...
5335,2095c3f8d4bf944c203c0da5cde064f0,POINT (-74.25145 40.51075),2095c3f8d4bf944c203c0da5cde064f0,3.0
5336,aaab531b8d25cf88a5b53a009dd27c55,POINT (-74.25135 40.50285),aaab531b8d25cf88a5b53a009dd27c55,1.0
5337,2544ebb02493c6aff14386c7a23b0be6,POINT (-74.23166 40.50272),,
5338,2948f6c8665d68544e636ff141111494,POINT (-74.23512 40.50168),,


In [12]:
# drop a unnecessary column
gdf_short_segment = gdf_short_segment.drop('nearest_id', axis=1)

In [13]:
# replace NaN values to 0, (the Nan value means the short segment didn't have any crash )
gdf_short_segment = gdf_short_segment.fillna(0)

In [14]:
# check distribution
gdf_short_segment.loc[gdf_short_segment['count']!=0].quantile([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])

Unnamed: 0,count
0.0,1.0
0.1,1.0
0.2,3.0
0.3,5.0
0.4,8.0
0.5,11.0
0.6,16.0
0.7,22.0
0.8,32.0
0.9,54.0


In [15]:
gdf_short_segment.to_file('../data/cleaned_data/crash_aggregation/short_segment_centroid.geojson', driver='GeoJSON')

### 2. node

Process in this part is same as '1-1. short segment'. Here we count the crashed per node point

In [16]:
# make sure that you run the 'shst_02_extract_short_segments' notebook to get the 'shst_node_filtered.shp'
# import a shapefile of the node point
gdf_node = gpd.read_file('../data/sharedstreets_geometry/node_filtered/shst_node_filtered.shp')

Here, we need to filter the node points based on the NYU boundary

In [17]:
# import nyc boundary shapefile
gdf_nyc = gpd.read_file('../data/borough_boundaries_w_water/geo_export_b872bcc2-4115-4a61-9581-5cdb4e9449e6.shp')

In [18]:
gdf_node.shape

(46448, 2)

In [19]:
# filter the sharedstreets segments
gdf_node = gpd.sjoin(gdf_node, gdf_nyc, op='intersects').drop(['index_right',
                                                              'boro_code',
                                                              'boro_name',
                                                              'shape_area',
                                                              'shape_leng'], axis=1) 

  warn(


In [20]:
gdf_node

Unnamed: 0,node_id,geometry
0,374b01a56e64379b8d7198962eaede90,POINT (-73.91694 40.64668)
1,37db438d57f16f92e5ba91f1ad1793bb,POINT (-73.91765 40.64623)
2,5b6e4972c82ad4eb6d24c17b94b33b59,POINT (-73.91621 40.64715)
3,c8dd8ecf9b57214609ecead610eef9cb,POINT (-73.91715 40.64387)
4,a19ad445993732ebd6b49a61801b9547,POINT (-73.91665 40.64342)
...,...,...
45418,016e4f9296c38e065099f993128762c8,POINT (-73.92334 40.81100)
45419,a4f4b6c170995cf6c619a488a915f065,POINT (-73.92658 40.80984)
45420,2b529042ddf4f71035f5f01e0b555662,POINT (-73.92474 40.80899)
45421,ac17794c8738a19d35677ed9e7071da4,POINT (-73.91729 40.80322)


In [21]:
# make sure that you run the 'crash_02_define_Intersection_crash' notebook to get the 'crash_intersection.shp'
# import crash dataset
gdf_crash_node = gpd.read_file('../data/cleaned_data/crash_seperated/crash_intersection/crash_intersection.shp')

In [22]:
# drop unnecessary columns and rename
gdf_crash_node = gdf_crash_node.rename(columns={'collision_':'collision_id',
                                                'nearest_no':'node_id'})[['collision_id',
                                                                          'node_id',
                                                                          'crash_date',
                                                                          'crash_time']] 

In [23]:
gdf_crash_node.head(3)

Unnamed: 0,collision_id,node_id,crash_date,crash_time
0,3530470,3417fa2d8bcda5154d51fc800439527c,2016-10-01,00:30:00
1,3562305,3417fa2d8bcda5154d51fc800439527c,2016-11-16,17:03:00
2,3577626,3417fa2d8bcda5154d51fc800439527c,2016-12-10,14:00:00


In [24]:
# group by 'node_id' and count the crashes
gdf_crash_count_node = gdf_crash_node[['node_id','collision_id']].groupby('node_id', as_index=False).count()

In [25]:
# rename a column
gdf_crash_count_node = gdf_crash_count_node.rename(columns={'collision_id':'count'})

In [26]:
gdf_crash_count_node.head()

Unnamed: 0,node_id,count
0,0005ae85e017c72c69cbdcd38f986f04,19
1,0005f85368f314bfb0c82b84d0208b9f,1
2,0008525ffca74a2e2af21a2acc91458e,10
3,000c4a68b5f33e69f5de1ed89ad88dfc,7
4,000f29cb12236672cf87a364528139a2,37


In [27]:
# merge two datasets
gdf_node = gdf_node.merge(gdf_crash_count_node, on= 'node_id' ,how='left')

In [28]:
# replace NaN values to 0, (the Nan value means the short segment didn't have any crash )
gdf_node = gdf_node.fillna(0)

In [29]:
# check distribution
gdf_node.loc[gdf_node['count']!=0].quantile([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])

Unnamed: 0,count
0.0,1.0
0.1,1.0
0.2,2.0
0.3,3.0
0.4,4.0
0.5,6.0
0.6,8.0
0.7,11.0
0.8,15.0
0.9,24.0


In [30]:
gdf_node.to_file('../data/cleaned_data/crash_aggregation/node.geojson', driver='GeoJSON')

### 3. segment

A process of this part is same with 1.short segment and 2. node

In [31]:
# make sure that you run the 'shst_02_extract_short_segments' notebook to get the 'shst_segment_filtered.shp'
# import a shapefile of the short segment centroid
gdf_segment = gpd.read_file('../data/sharedstreets_geometry/segment_filtered/shst_segment_filtered.shp')

In [32]:
# drop unnecessary columns
gdf_segment = gdf_segment[['id','geometry']]

In [33]:
# make sure that you run the 'crash_02_define_Intersection_crash' notebook to get the 'crash_segment.shp'
# import crash dataset
gdf_crash_segment = gpd.read_file('../data/cleaned_data/crash_seperated/crash_segment/crash_segment.shp')

In [34]:
#  rename
gdf_crash_segment = gdf_crash_segment.rename(columns={'collision_':'collision_id',
                                                      'geometry_i':'geometry_id'})[['collision_id',
                                                                                    'geometry_id',
                                                                                    'crash_date',
                                                                                    'crash_time']] 

In [35]:
gdf_crash_segment.head()

Unnamed: 0,collision_id,geometry_id,crash_date,crash_time
0,3531327,ba4520777941a56b87f97a1d35dc2e20,2016-10-01,20:20:00
1,3530538,da0bde3c3c147e230387851d1679e6bc,2016-10-01,01:40:00
2,3531662,3cf56bacd5522948619990790426e93a,2016-10-01,21:30:00
3,3533597,4f2d7b279a2a4afe38fffab8a5e018c9,2016-10-01,06:30:00
4,3531918,04a23f1a32ca88089f4ca72fd8d259b7,2016-10-01,10:00:00


In [36]:
# group by 'geometry_id' and count the crashes
gdf_crash_count_segment = gdf_crash_segment[['geometry_id','collision_id']].groupby('geometry_id', as_index=False).count()

In [37]:
#  rename
gdf_crash_count_segment = gdf_crash_count_segment.rename(columns={'collision_id':'count'})

In [38]:
gdf_crash_count_segment.head()

Unnamed: 0,geometry_id,count
0,000182e6b337ab6b7c6053a7499de445,6
1,0002d35fe99ef772e991f4f78b338f0f,7
2,0004c29ad411e1df57c0bd30aebf751e,2
3,00065f29c5928a427d1da7074b3af66b,7
4,0006fe158c203a51cda960f5c4c766ee,6


In [39]:
# merge two datasets
gdf_segment = gdf_segment.merge(gdf_crash_count_segment, left_on = 'id', right_on = 'geometry_id', how='left')

In [40]:
# drop unnecessary columns
gdf_segment = gdf_segment.drop('geometry_id', axis=1)

In [41]:
# replace NaN values to 0, (the Nan value means the short segment didn't have any crash )
gdf_segment = gdf_segment.fillna(0)

In [42]:
gdf_segment.shape

(87455, 3)

In [43]:
# check distribution
gdf_segment['count'].loc[gdf_segment['count']!=0].quantile([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])

0.0     1.0
0.1     1.0
0.2     1.0
0.3     1.0
0.4     2.0
0.5     2.0
0.6     3.0
0.7     4.0
0.8     6.0
0.9    11.0
Name: count, dtype: float64

In [44]:
gdf_segment.to_file('../data/cleaned_data/crash_aggregation/segment.geojson', driver='GeoJSON')

# 2. Calculate monthly crashes

## 2.1 short segment

In [45]:
# change datatype of crash_date as np.datetime64
gdf_crash_short_segment['crash_date'] = pd.to_datetime(gdf_crash_short_segment['crash_date'])

In [46]:
# extract month
gdf_crash_short_segment['crash_month'] = gdf_crash_short_segment['crash_date'].dt.to_period('M').dt.to_timestamp()

In [47]:
gdf_crash_short_segment.head()

Unnamed: 0,collision_id,nearest_id,crash_date,crash_time,crash_month
0,3530528,e95045cba4602b73447ed579683d85ae,2016-10-01,03:18:00,2016-10-01
1,3553471,e95045cba4602b73447ed579683d85ae,2016-11-02,05:54:00,2016-11-01
2,3562630,e95045cba4602b73447ed579683d85ae,2016-11-17,00:00:00,2016-11-01
3,3585815,e95045cba4602b73447ed579683d85ae,2016-12-23,11:45:00,2016-12-01
4,3589594,e95045cba4602b73447ed579683d85ae,2016-12-30,00:30:00,2016-12-01


In [48]:
# count the crashes per short segment and month
df_crash_short_segment_monthly_count =  gdf_crash_short_segment[['nearest_id',
                                                                  'crash_month',
                                                                  'collision_id']].groupby(['nearest_id',
                                                                                           'crash_month'],
                                                                                            as_index=False).count()

# rename
df_crash_short_segment_monthly_count = df_crash_short_segment_monthly_count.rename(columns={'collision_id':'count'})

In [49]:
# make a list of month and segment id
monthly_range = pd.date_range(start='2016-10-01', end='2019-10-01', freq='MS')
list_short_segment_id = gdf_short_segment['id'].unique().tolist()

In [50]:
# create the list of empty dataframe per segment
list_empty_dataframe = []
for i in list_short_segment_id:
    temp_ = pd.DataFrame(monthly_range).rename(columns={0:'month'})
    temp_['id'] = i
    list_empty_dataframe.append(temp_)

In [51]:
# concatenate the empty dataframes
df_monthly_crash = pd.concat(list_empty_dataframe, axis=0)

In [52]:
# merge dataframe. By doing this, we will get the number of crashes per each short segment, 
# and there will be NaN values if there is no crashes at a certain segment and month
df_monthly_crash = df_monthly_crash.merge(df_crash_short_segment_monthly_count,
                                         left_on=['id','month'],
                                         right_on=['nearest_id','crash_month'],
                                         how='left')

In [53]:
# drop unnecessary columns, fill NaN values as 0
df_monthly_crash = df_monthly_crash.drop(['nearest_id','crash_month'], axis=1)
df_monthly_crash = df_monthly_crash.fillna(0)

In [54]:
# change datatype to int
df_monthly_crash['count'] = df_monthly_crash['count'].astype(int)

In [55]:
# export datasets of monthly array, this will be used in the datasets.
pd.Series(monthly_range).to_csv('../data/cleaned_data/crash_aggregation/month_array.csv', index=False)

To minimize the size of the csv file, we will save the number of crashes as a list

In [56]:
df_monthly_crash_count_list = df_monthly_crash[['id','count']].groupby('id')['count'].apply(list)

In [57]:
# check the list was created by time order
df_monthly_crash.loc[df_monthly_crash['id']=='00010fd3ee560483c21bb98e414741c7']

Unnamed: 0,month,id,count
31783,2016-10-01,00010fd3ee560483c21bb98e414741c7,2
31784,2016-11-01,00010fd3ee560483c21bb98e414741c7,1
31785,2016-12-01,00010fd3ee560483c21bb98e414741c7,4
31786,2017-01-01,00010fd3ee560483c21bb98e414741c7,1
31787,2017-02-01,00010fd3ee560483c21bb98e414741c7,1
31788,2017-03-01,00010fd3ee560483c21bb98e414741c7,3
31789,2017-04-01,00010fd3ee560483c21bb98e414741c7,0
31790,2017-05-01,00010fd3ee560483c21bb98e414741c7,2
31791,2017-06-01,00010fd3ee560483c21bb98e414741c7,1
31792,2017-07-01,00010fd3ee560483c21bb98e414741c7,2


In [58]:
# export the dataset
df_monthly_crash_count_list.to_csv('../data/cleaned_data/crash_aggregation/crash_short_segment_monthly.csv')

## 2.2 Node

Same process as the short segment parts.

In [59]:
# change datatype of crash_date as np.datetime64
gdf_crash_node['crash_date'] = pd.to_datetime(gdf_crash_node['crash_date'])

In [60]:
# extract month
gdf_crash_node['crash_month'] = gdf_crash_node['crash_date'].dt.to_period('M').dt.to_timestamp()

In [61]:
gdf_crash_node.head()

Unnamed: 0,collision_id,node_id,crash_date,crash_time,crash_month
0,3530470,3417fa2d8bcda5154d51fc800439527c,2016-10-01,00:30:00,2016-10-01
1,3562305,3417fa2d8bcda5154d51fc800439527c,2016-11-16,17:03:00,2016-11-01
2,3577626,3417fa2d8bcda5154d51fc800439527c,2016-12-10,14:00:00,2016-12-01
3,3582823,3417fa2d8bcda5154d51fc800439527c,2016-12-19,05:45:00,2016-12-01
4,3594234,3417fa2d8bcda5154d51fc800439527c,2017-01-07,09:50:00,2017-01-01


In [62]:
# count the crashes per node and month
df_crash_node_monthly_count =  gdf_crash_node[['node_id',
                                               'crash_month',
                                               'collision_id']].groupby(['node_id',
                                                                         'crash_month'],
                                                                         as_index=False).count()

df_crash_node_monthly_count = df_crash_node_monthly_count.rename(columns={'collision_id':'count'})

In [63]:
# make a list of month and node id
monthly_range = pd.date_range(start='2016-10-01', end='2019-10-01', freq='MS')
list_node_id = gdf_node['node_id'].unique().tolist()

In [64]:
# create the list of empty dataframe per node
list_empty_dataframe = []
for i in list_node_id:
    temp_ = pd.DataFrame(monthly_range).rename(columns={0:'month'})
    temp_['id'] = i
    list_empty_dataframe.append(temp_)

In [65]:
# concatenate the empty dataframes
df_monthly_crash = pd.concat(list_empty_dataframe, axis=0)

In [66]:
# merge dataframe. By doing this, we will get the number of crashes per each node, 
# and there will be NaN values if there is no crashes at a certain node and month
df_monthly_crash = df_monthly_crash.merge(df_crash_node_monthly_count,
                                         left_on=['id','month'],
                                         right_on=['node_id','crash_month'],
                                         how='left')

In [67]:
# drop unnecessary columns, fill NaN values as 0
df_monthly_crash = df_monthly_crash.drop(['node_id','crash_month'], axis=1)
df_monthly_crash = df_monthly_crash.fillna(0)

In [68]:
# change datatype to int
df_monthly_crash['count'] = df_monthly_crash['count'].astype(int)

In [69]:
# To minimize the size of the csv file, we will save the number of crashes as a list
df_monthly_crash_count_list = df_monthly_crash[['id','count']].groupby('id')['count'].apply(list)

In [70]:
# export the dataset
df_monthly_crash_count_list.to_csv('../data/cleaned_data/crash_aggregation/crash_node_monthly.csv')

## 2.3 Segment

Same process as the short segment and node parts.

In [71]:
# change datatype of crash_date as np.datetime64
gdf_crash_segment['crash_date'] = pd.to_datetime(gdf_crash_segment['crash_date'])

In [72]:
# extract month
gdf_crash_segment['crash_month'] = gdf_crash_segment['crash_date'].dt.to_period('M').dt.to_timestamp()

In [73]:
gdf_crash_segment

Unnamed: 0,collision_id,geometry_id,crash_date,crash_time,crash_month
0,3531327,ba4520777941a56b87f97a1d35dc2e20,2016-10-01,20:20:00,2016-10-01
1,3530538,da0bde3c3c147e230387851d1679e6bc,2016-10-01,01:40:00,2016-10-01
2,3531662,3cf56bacd5522948619990790426e93a,2016-10-01,21:30:00,2016-10-01
3,3533597,4f2d7b279a2a4afe38fffab8a5e018c9,2016-10-01,06:30:00,2016-10-01
4,3531918,04a23f1a32ca88089f4ca72fd8d259b7,2016-10-01,10:00:00,2016-10-01
...,...,...,...,...,...
250029,4232681,dd9128bb0ba0537752c5f8b63a04a220,2019-10-31,05:00:00,2019-10-01
250030,4236430,a3396534bd98ea8552b8b982cd17a510,2019-10-31,13:00:00,2019-10-01
250031,4233134,f4ce05d8cd482c68af3eb4bfb90fc106,2019-10-31,14:20:00,2019-10-01
250032,4234695,f2a11635cbcf17ecb1ebe7c938f09bdc,2019-10-31,14:20:00,2019-10-01


In [74]:
# count the crashes per segment and month
df_crash_segment_monthly_count =  gdf_crash_segment[['geometry_id',
                                                     'crash_month',
                                                     'collision_id']].groupby(['geometry_id',
                                                                               'crash_month'],
                                                                                as_index=False).count()
  
df_crash_segment_monthly_count = df_crash_segment_monthly_count.rename(columns={'collision_id':'count'})

In [75]:
# make a list of month and segment id
monthly_range = pd.date_range(start='2016-10-01', end='2019-10-01', freq='MS')
list_segment_id = gdf_segment['id'].unique().tolist()

In [76]:
# create the list of empty dataframe per segment
list_empty_dataframe = []
for i in list_segment_id:
    temp_ = pd.DataFrame(monthly_range).rename(columns={0:'month'})
    temp_['id'] = i
    list_empty_dataframe.append(temp_)

In [77]:
# concatenate the empty dataframes
df_monthly_crash = pd.concat(list_empty_dataframe, axis=0)

In [78]:
# merge dataframe. By doing this, we will get the number of crashes per each segment, 
# and there will be NaN values if there is no crashes at a certain segment and month
df_monthly_crash = df_monthly_crash.merge(df_crash_segment_monthly_count,
                                         left_on=['id','month'],
                                         right_on=['geometry_id','crash_month'],
                                         how='left')

In [79]:
# drop unnecessary columns, fill NaN values as 0
df_monthly_crash = df_monthly_crash.drop(['geometry_id','crash_month'], axis=1)
df_monthly_crash = df_monthly_crash.fillna(0)

In [80]:
# change datatype to int
df_monthly_crash['count'] = df_monthly_crash['count'].astype(int)

In [81]:
# To minimize the size of the csv file, we will save the number of crashes as a list
df_monthly_crash_count_list = df_monthly_crash[['id','count']].groupby('id')['count'].apply(list)

In [82]:
# check the list was created by time order
df_monthly_crash_count_list.head()

id
0000b4f516894dfb309654e1a12bc7b1    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
000115ffe0b626b4c1310827d7b28822    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
000182e6b337ab6b7c6053a7499de445    [0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
0001f6598a4739e7244d278eb317cb39    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
0002985acfe74fd5ff37d246e5509fe4    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: count, dtype: object

In [83]:
# check the list was created by time order
df_monthly_crash.loc[df_monthly_crash['id']=='000182e6b337ab6b7c6053a7499de445']

Unnamed: 0,month,id,count
674769,2016-10-01,000182e6b337ab6b7c6053a7499de445,0
674770,2016-11-01,000182e6b337ab6b7c6053a7499de445,0
674771,2016-12-01,000182e6b337ab6b7c6053a7499de445,1
674772,2017-01-01,000182e6b337ab6b7c6053a7499de445,0
674773,2017-02-01,000182e6b337ab6b7c6053a7499de445,0
674774,2017-03-01,000182e6b337ab6b7c6053a7499de445,0
674775,2017-04-01,000182e6b337ab6b7c6053a7499de445,0
674776,2017-05-01,000182e6b337ab6b7c6053a7499de445,0
674777,2017-06-01,000182e6b337ab6b7c6053a7499de445,1
674778,2017-07-01,000182e6b337ab6b7c6053a7499de445,1


In [84]:
# export the dataset
df_monthly_crash_count_list.to_csv('../data/cleaned_data/crash_aggregation/crash_segment_monthly.csv')

# 3. Calculate hourly crash counts

## 3.1 short segment

Same process as monthly crash counts chapter, but change the unit of time as an hour 

In [85]:
# change datatype of crash_date as np.datetime64
gdf_crash_short_segment['crash_time'] = pd.to_datetime(gdf_crash_short_segment['crash_time'])

In [86]:
# extract hour
gdf_crash_short_segment['hour'] = gdf_crash_short_segment['crash_time'].dt.hour

In [87]:
# count the crashes per segment and hour
df_crash_short_segment_hourly_count =  gdf_crash_short_segment[['nearest_id',
                                                                'hour',
                                                                'collision_id']].groupby(['nearest_id',
                                                                                          'hour'],
                                                                                            as_index=False).count()

df_crash_short_segment_hourly_count = df_crash_short_segment_hourly_count.rename(columns={'collision_id':'count'})

In [88]:
# make a list of hour and segment id
hourly_range = list(range(0,24))
list_short_segment_id = df_crash_short_segment_monthly_count['nearest_id'].unique().tolist()

In [89]:
# create the list of empty dataframe per segment
list_empty_dataframe = []
for i in list_short_segment_id:
    temp_ = pd.DataFrame(hourly_range).rename(columns={0:'hour'})
    temp_['id'] = i
    list_empty_dataframe.append(temp_)

In [90]:
# concatenate the empty dataframes
df_hourly_crash = pd.concat(list_empty_dataframe, axis=0)

In [91]:
# merge dataframe. By doing this, we will get the number of crashes per each segment, 
# and there will be NaN values if there is no crashes at a certain segment and hour
df_hourly_crash = df_hourly_crash.merge(df_crash_short_segment_hourly_count,
                                         left_on=['id','hour'],
                                         right_on=['nearest_id','hour'],
                                         how='left')

In [92]:
# drop unnecessary columns, fill NaN values as 0
df_hourly_crash = df_hourly_crash.drop('nearest_id', axis=1)
df_hourly_crash = df_hourly_crash.fillna(0)

In [93]:
# change datatype to int
df_hourly_crash['count'] = df_hourly_crash['count'].astype(int)

In [94]:
# To minimize the size of the csv file, we will save the number of crashes as a list
df_hourly_crash_count_list = df_hourly_crash[['id','count']].groupby('id')['count'].apply(list)

In [95]:
# check the list was created by time order
df_hourly_crash.loc[df_hourly_crash['id']=='00010fd3ee560483c21bb98e414741c7'] 

Unnamed: 0,hour,id,count
0,0,00010fd3ee560483c21bb98e414741c7,3
1,1,00010fd3ee560483c21bb98e414741c7,2
2,2,00010fd3ee560483c21bb98e414741c7,0
3,3,00010fd3ee560483c21bb98e414741c7,1
4,4,00010fd3ee560483c21bb98e414741c7,0
5,5,00010fd3ee560483c21bb98e414741c7,0
6,6,00010fd3ee560483c21bb98e414741c7,1
7,7,00010fd3ee560483c21bb98e414741c7,6
8,8,00010fd3ee560483c21bb98e414741c7,6
9,9,00010fd3ee560483c21bb98e414741c7,4


In [96]:
# check the list was created by time order
df_hourly_crash_count_list.iloc[0]

[3, 2, 0, 1, 0, 0, 1, 6, 6, 4, 3, 2, 0, 4, 4, 5, 2, 1, 3, 0, 2, 0, 2, 1]

In [97]:
# export the dataset
df_hourly_crash_count_list.to_csv('../data/cleaned_data/crash_aggregation/crash_short_segment_hourly.csv')

## 3.2 node

In [98]:
# change datatype of crash_date as np.datetime64
gdf_crash_node['crash_time'] = pd.to_datetime(gdf_crash_node['crash_time'])

In [99]:
# extract hour
gdf_crash_node['hour'] = gdf_crash_node['crash_time'].dt.hour

In [100]:
# count the crashes per node and hour
df_crash_node_hourly_count =  gdf_crash_node[['node_id',
                                              'hour',
                                              'collision_id']].groupby(['node_id',
                                                                        'hour'],
                                                                        as_index=False).count()

df_crash_node_hourly_count = df_crash_node_hourly_count.rename(columns={'collision_id':'count'})

In [101]:
# make a list of hour and node id
hourly_range = list(range(0,24))
list_node_id = gdf_node['node_id'].unique().tolist()

In [None]:
# create the list of empty dataframe per segment
list_empty_dataframe = []
for i in list_node_id:
    temp_ = pd.DataFrame(hourly_range).rename(columns={0:'hour'})
    temp_['id'] = i
    list_empty_dataframe.append(temp_)

In [None]:
# concatenate the empty dataframes
df_hourly_crash = pd.concat(list_empty_dataframe, axis=0)

In [None]:
# merge dataframe. By doing this, we will get the number of crashes per each node, 
# and there will be NaN values if there is no crashes at a certain segment and hour
df_hourly_crash = df_hourly_crash.merge(df_crash_node_hourly_count,
                                         left_on=['id','hour'],
                                         right_on=['node_id','hour'],
                                         how='left')

In [None]:
# drop unnecessary columns, fill NaN values as 0
df_hourly_crash = df_hourly_crash.drop('node_id', axis=1)
df_hourly_crash = df_hourly_crash.fillna(0)

In [None]:
# change datatype to int
df_hourly_crash['count'] = df_hourly_crash['count'].astype(int)

In [None]:
# To minimize the size of the csv file, we will save the number of crashes as a list
df_hourly_crash_count_list = df_hourly_crash[['id','count']].groupby('id')['count'].apply(list)

In [None]:
# export the dataset
df_hourly_crash_count_list.to_csv('../data/cleaned_data/crash_aggregation/crash_node_hourly.csv')

## 3.3 Segment

In [None]:
# change datatype of crash_date as np.datetime64
gdf_crash_segment['crash_time'] = pd.to_datetime(gdf_crash_segment['crash_time'])

In [None]:
# extract hour
gdf_crash_segment['hour'] = gdf_crash_segment['crash_time'].dt.hour

In [None]:
# count the crashes per segment and hour
df_crash_segment_hourly_count =  gdf_crash_segment[['geometry_id',
                                                    'hour',
                                                    'collision_id']].groupby(['geometry_id',
                                                                              'hour'],
                                                                              as_index=False).count()

df_crash_segment_hourly_count = df_crash_segment_hourly_count.rename(columns={'collision_id':'count'})

In [None]:
# make a list of hour and segment id
hourly_range = list(range(0,24))
list_segment_id = gdf_segment['id'].unique().tolist()

In [None]:
# create the list of empty dataframe per segment
list_empty_dataframe = []
for i in list_segment_id:
    temp_ = pd.DataFrame(hourly_range).rename(columns={0:'hour'})
    temp_['id'] = i
    list_empty_dataframe.append(temp_)

In [None]:
# concatenate the empty dataframes
df_hourly_crash = pd.concat(list_empty_dataframe, axis=0)

In [None]:
# merge dataframe. By doing this, we will get the number of crashes per each segment, 
# and there will be NaN values if there is no crashes at a certain segment and hour
df_hourly_crash = df_hourly_crash.merge(df_crash_segment_hourly_count,
                                        left_on=['id','hour'],
                                        right_on=['geometry_id','hour'],
                                        how='left')

In [None]:
# drop unnecessary columns, fill NaN values as 0
df_hourly_crash = df_hourly_crash.drop('geometry_id', axis=1)
df_hourly_crash = df_hourly_crash.fillna(0)

In [None]:
# change datatype to int
df_hourly_crash['count'] = df_hourly_crash['count'].astype(int)

In [None]:
# To minimize the size of the csv file, we will save the number of crashes as a list
df_hourly_crash_count_list = df_hourly_crash[['id','count']].groupby('id')['count'].apply(list)

In [None]:
# export the dataset
df_hourly_crash_count_list.to_csv('../data/cleaned_data/crash_aggregation/crash_segment_hourly.csv')