# Building a SQL database
This notebook builds the necessary db files for the project using SQLite3.

Most data is taken from Chicago Data Portal https://data.cityofchicago.org/ using Socrata library.
The API endpoints for the data are:
- Red Light Violations: https://data.cityofchicago.org/resource/spqx-js37.json
- Congestion by Region 2018-Present: https://data.cityofchicago.org/resource/kf7e-cur8.json
- Congestion by Region 2013-2018: https://data.cityofchicago.org/resource/emtn-qqdi.json
- Traffic Crashes: https://data.cityofchicago.org/resource/85ca-t3if.json

Weather data is taken from https://openweathermap.org/weather-data and is saved as csv in data folder

Tables to build:
- daily_violations (one entry for each camera each day with total violations)
- intersection_locations (one entry for each intersection with lat/long)
- intersection_cams (one entry for each intersection with camera_ids)
- signal_crashes (one entry for each intersection crash with many columns)
- cam_locations (one entry for each cam, with lat/long)
- cam_startend (one entry for each cam with start end dates for min/max dates active)
- hourly_congestion (one entry per hour with bus speed averages for each region)
- hourly_weather (one entry per hour with many weather cols)
- region_data (one entry per region with locations and descriptions to place intersections)


## Required Imports

In [93]:
import pandas as pd
from sodapy import Socrata
#import matplotlib.pyplot as plt
from datetime import datetime
from modules.myfuncs import *
import warnings
import numpy as np
from geopy.geocoders import Nominatim
# import dask
# import dask.dataframe as dd
import gc
from scipy.stats import mode

warnings.filterwarnings('ignore')

## Create/connect to db and build the TABLEs

In [94]:
# Create a db file or open connection
conn = create_connection('database/rlc2.db')  # function I created in myfuncs file
c = conn.cursor()
#conn.close()

sqlite3 version: 2.6.0
connected to database/rlc2.db


## Set up the Socrata client
The Chicago Data Portal, which contains most of my data used in this project, uses the Socrata software which can be accessed through Python's sodapy library.
Here we create a client, which we will use to query the data at the portal.


In [3]:
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:

url = "data.cityofchicago.org"
client = Socrata(url, None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofchicago.org,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")



# TABLE builds

For every TABLE
- Use a Socrata client query to get all relevant data
- Preprocess data as needed
- Create Table

Our data
- rlc_cam is up to 1M redlight cams from 2015 to 2020
- crash_data is up to 1M crashes from 2015 to 2020
- traffic_data is up to 10M from 2015 to 2020

Weather data is taken from csv in data folder

## 1) Build intersection_chars TABLE from int_df

This data is created by me.  It is a dictionary contained in this repository under the file 'int_chars.py'.

To create this file, I went through all 180+ intersections with red light cameras.  I cross referenced it with a map at https://data.cityofchicago.org/Transportation/Average-Daily-Traffic-Counts-Map/pf56-35rv and google maps to compile the following data.

- roads (list): of road segments as identified in average-daily-traffic-counts db in link above.  Used to determine volume of traffic.
- protected_turn (int): How any of the left turns are protected (left turn arrow).
- total_lanes (int): Count of total lanes.  If a road has one lane for all directions N/E/S/W bound traffic, that counts as 4.  Rangees from 3 to 14 lanes.
- medians (int): Count of physical median barriers that extend up to intersection.
- exit (int): 0 if no exit.  1 if exit on/off ramp within 100m of center of intersection.  Traffic flow is affected by proximity to exit.
- split (int): 1 if it is a divided boulevard (common in Chicago) where divided lanes are split by traffic signals in median. Look at examples on google map.
- way (int): directions of traffic flow. A 4 way intersection might be NESW.
- underpass (int): number of ways that have an underpass extending up to the intersection.  These are notoriously bad intersections in Chicago.
- no_left (int): number of no left turn signs.  Usually with smaller streets onto larger roads or high volumne intersections.
- angled (int): 1 if angle between two 2way roads is greater than 30 degrees (used 1/2/sqrt(3) rule to measure.
- triangle (int): 1 if three 2way roads meet intersect or form a triangle where all 3 roads <50m
- one_way (int): number of 1 way directions.
- turn_lanes (int): how many directions have physical and identified turn lanes for left hand turns.
- lat (float): latitude of center of inersection.  
- long (float): longitude 
- rlc (int): 1 for red light camera is present
- intersection (str): name of intersection as defined in signal_crashes table in db
- daily_traffic (int): volume of daily traffic through intersection.  Sum of incoming roads from roads list.

### Import the dictionary and convert to DataFrame which will be written as Table in my db

In [4]:
from modules.int_chars import *
import pandas as pd

int_chars.keys()
int_df = pd.DataFrame.from_dict(int_chars, orient='index')
int_df['intersection'] = int_chars.keys()
int_df.isna().sum()

roads             0
protected_turn    0
total_lanes       0
medians           0
exit              0
split             0
way               0
underpass         0
no_left           0
angled            0
triangle          0
one_way           0
turn_lanes        0
lat               0
long              0
rlc               0
intersection      0
dtype: int64

For now, we will only use intersections with rlc of 1.  I may later add intersections without rlc to identify crash characterisics.

In [5]:
int_df = int_df[int_df['rlc']==1]
int_df.columns

Index(['roads', 'protected_turn', 'total_lanes', 'medians', 'exit', 'split',
       'way', 'underpass', 'no_left', 'angled', 'triangle', 'one_way',
       'turn_lanes', 'lat', 'long', 'rlc', 'intersection'],
      dtype='object')

They were read in as non_null objects.  Would like to cast them before creating a table

In [6]:
cols_toint = ['protected_turn', 'total_lanes', 'medians', 'exit', 'split',
       'way', 'underpass', 'no_left', 'angled', 'triangle', 'one_way',
       'turn_lanes', 'rlc']
cols_tofloat = ['lat', 'long',]

int_df[cols_toint] = int_df[cols_toint].astype(int)
int_df[cols_tofloat] = int_df[cols_tofloat].astype(float)
int_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 183 entries, 111TH AND HALSTED to WESTERN AND TOUHY
Data columns (total 17 columns):
roads             183 non-null object
protected_turn    183 non-null int64
total_lanes       183 non-null int64
medians           183 non-null int64
exit              183 non-null int64
split             183 non-null int64
way               183 non-null int64
underpass         183 non-null int64
no_left           183 non-null int64
angled            183 non-null int64
triangle          183 non-null int64
one_way           183 non-null int64
turn_lanes        183 non-null int64
lat               183 non-null float64
long              183 non-null float64
rlc               183 non-null int64
intersection      183 non-null object
dtypes: float64(2), int64(13), object(2)
memory usage: 25.7+ KB


In [7]:
int_df.head()

Unnamed: 0,roads,protected_turn,total_lanes,medians,exit,split,way,underpass,no_left,angled,triangle,one_way,turn_lanes,lat,long,rlc,intersection
111TH AND HALSTED,"[28 West, 11600 South]",2,6,2,0,0,4,0,0,1,0,0,2,41.692362,-87.642423,1,111TH AND HALSTED
115TH AND HALSTED,"[714 West, 11600 South]",4,6,2,0,0,4,0,0,0,0,0,4,41.685089,-87.642094,1,115TH AND HALSTED
119TH AND HALSTED,"[446 West, 11600 South]",4,6,2,0,0,4,0,0,0,0,0,4,41.677774,-87.64193,1,119TH AND HALSTED
31ST AND CALIFORNIA,"[2825 West, 3026 South]",2,6,0,0,0,4,0,0,0,0,0,4,41.837424,-87.695022,1,31ST AND CALIFORNIA
31ST ST AND MARTIN LUTHER KING DRIVE,"[440 East, 3030 South]",2,10,2,0,1,4,0,2,0,0,0,0,41.838441,-87.617338,1,31ST ST AND MARTIN LUTHER KING DRIVE


#### Now bring in my count information using my roads list associated with each intersection

In [8]:
daily_traffic = client.get("pfsx-4n4m", 
                     limit=2000,
                    )

daily_traffic = pd.DataFrame.from_records(daily_traffic) # Convert to pandas DataFrame

In [9]:
daily_traffic.info()
cols_tokeep = ['traffic_volume_count_location_address', 'total_passing_vehicle_volume',]
daily_traffic = daily_traffic[cols_tokeep]

daily_traffic.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1279 entries, 0 to 1278
Data columns (total 15 columns):
id                                             1279 non-null object
traffic_volume_count_location_address          1279 non-null object
street                                         1279 non-null object
date_of_count                                  1279 non-null object
total_passing_vehicle_volume                   1279 non-null object
vehicle_volume_by_each_direction_of_traffic    1279 non-null object
latitude                                       1279 non-null object
longitude                                      1279 non-null object
location                                       1279 non-null object
:@computed_region_rpca_8um6                    1266 non-null object
:@computed_region_vrxf_vc4k                    1266 non-null object
:@computed_region_6mkv_f3dw                    1279 non-null object
:@computed_region_bdys_3d7i                    1265 non-null object
:@compute

Unnamed: 0,traffic_volume_count_location_address,total_passing_vehicle_volume
0,5838 West,7100
1,320 East,8600
2,1730 East,53500
3,125 East,700
4,2924 East,4200


In [10]:
daily_traffic.total_passing_vehicle_volume = daily_traffic.total_passing_vehicle_volume.astype(int)

daily_traffic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1279 entries, 0 to 1278
Data columns (total 2 columns):
traffic_volume_count_location_address    1279 non-null object
total_passing_vehicle_volume             1279 non-null int64
dtypes: int64(1), object(1)
memory usage: 20.1+ KB


Combine my characteristics with my daily_traffic by looking up traffic volume from daily_traffic df.

In [11]:
int_df.roads.head()

111TH AND HALSTED                        [28 West, 11600 South]
115TH AND HALSTED                       [714 West, 11600 South]
119TH AND HALSTED                       [446 West, 11600 South]
31ST AND CALIFORNIA                     [2825 West, 3026 South]
31ST ST AND MARTIN LUTHER KING DRIVE     [440 East, 3030 South]
Name: roads, dtype: object

In [12]:
def look_up_roads(road_list):
    '''
    Look up function to get the values and return the total
            Parameters:
                roads (list): road segment list for intersection
            Returns:
                total (int): combined traffic volume of every road in roads list.
    '''
    total = 0  
    for road in road_list:
        count = daily_traffic[daily_traffic['traffic_volume_count_location_address']==road]['total_passing_vehicle_volume'].values[0]
        total += count
    return total

int_df['daily_traffic'] = int_df['roads'].apply(look_up_roads)
int_df.drop(columns=['roads'], inplace=True)

In [13]:
int_df.head()

Unnamed: 0,protected_turn,total_lanes,medians,exit,split,way,underpass,no_left,angled,triangle,one_way,turn_lanes,lat,long,rlc,intersection,daily_traffic
111TH AND HALSTED,2,6,2,0,0,4,0,0,1,0,0,2,41.692362,-87.642423,1,111TH AND HALSTED,43100
115TH AND HALSTED,4,6,2,0,0,4,0,0,0,0,0,4,41.685089,-87.642094,1,115TH AND HALSTED,42500
119TH AND HALSTED,4,6,2,0,0,4,0,0,0,0,0,4,41.677774,-87.64193,1,119TH AND HALSTED,41800
31ST AND CALIFORNIA,2,6,0,0,0,4,0,0,0,0,0,4,41.837424,-87.695022,1,31ST AND CALIFORNIA,41100
31ST ST AND MARTIN LUTHER KING DRIVE,2,10,2,0,1,4,0,2,0,0,0,0,41.838441,-87.617338,1,31ST ST AND MARTIN LUTHER KING DRIVE,36500


In [14]:
# THIS WAS MOVED TO myfunc.  Test out with this deleted later.
def make_table(df, table_name, c, conn):
    '''
    table_name string
    c cursor object
    conn sql connection object
    '''
    if table_name in sql_fetch_tables(c, conn):  # helper function in myfuncs
        delete_all_entries(c, conn, table_name) # in myfuncs
    
    df.to_sql(table_name, conn, if_exists='replace', index = False)    
    print(sql_fetch_tables(c, conn))
    


In [15]:
make_table(int_df, 'intersection_chars', c, conn)

[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('all_hours',), ('int_startend',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('intersection_chars',)]


## 2) Build daily_violations TABLE  from rlc_df

This table will hold the red light camera violations data.  It will include daily violations of every red light camera.
This is a large dataset.  (320 cameras

### Get red light violation data from Socrata query

In [16]:
# Red light violations
# Takes several minutes to run and holds about 500mb in memory to build

# First 1000000 results, returned as JSON from API / converted to Python list of dictionaries by sodapy
rlc_df = client.get("spqx-js37", #speed cams are at 'hhkd-xvj4' if you want to investigate?
                     #where='violation_date > 01-01-2020',
                     where='violation_date > \'2016-01-01T00:00:00.000\'',
                     limit=10000000,
                    )

rlc_df = pd.DataFrame.from_records(rlc_df) # Convert to pandas DataFrame

### Preprocess Red Light Camera Data

Data Columns of interest:

INTERSECTION -
Intersection of the location of the red light enforcement camera(s). There may be more than one camera at each intersection. Plain Text

CAMERA ID -
A unique ID for each physical camera at an intersection, which may contain more than one camera. Plain Text

ADDRESS	-
The address of the physical camera (CAMERA ID). The address may be the same for all cameras or different, based on the physical installation of each camera. Plain Text

VIOLATION DATE -
The date of when the violations occurred. NOTE: The citation may be issued on a different date. Date & Time

VIOLATIONS - 
Number of violations for each camera on a particular day. Number

LATITUDE -
The latitude of the physical location of the camera(s) based on the ADDRESS column. Geocoded using the WGS84. Number

LONGITUDE -
The longitude of the physical location of the camera(s) based on the ADDRESS column. Geocoded using the WGS84.
Number

#### Investigate rlc_df

In [17]:
rlc_df.info()
rlc_df.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480825 entries, 0 to 480824
Data columns (total 10 columns):
intersection      480825 non-null object
camera_id         480709 non-null object
address           480825 non-null object
violation_date    480825 non-null object
violations        480825 non-null object
x_coordinate      455746 non-null object
y_coordinate      455746 non-null object
latitude          455746 non-null object
longitude         455746 non-null object
location          455746 non-null object
dtypes: object(10)
memory usage: 36.7+ MB


intersection          0
camera_id           116
address               0
violation_date        0
violations            0
x_coordinate      25079
y_coordinate      25079
latitude          25079
longitude         25079
location          25079
dtype: int64

#### Drop nan values and unnecessary columns
We see that we have all text/non-null objects.  Need to convert first before manipulating for preprocess.

There are a fair number of missing locations/lat/long.  Hope to be able to replace those missing values.
This represents a large enough portion of dataset that we should look them up.

The na values for camera_id will have to be dropped, since we don't know what they are.

We will not be using x andy y_coordinate, so we drop those.  We will also drop location.  We already have lat long in other columns.

In [18]:
#client_df.dropna(subset=['camera_id']).isna().sum()
try:
    # put this is a try in case we run it twice, it will skip it.
    rlc_df.dropna(subset=['camera_id'], inplace=True)
    
    # drop xy coord and location columns
    rlc_df = rlc_df.drop(columns=['x_coordinate', 'y_coordinate', 'location'], index=1)
except:
    pass



rlc_df.isna().sum()

intersection          0
camera_id             0
address               0
violation_date        0
violations            0
latitude          25074
longitude         25074
dtype: int64

#### Manipulate datatypes for preprocessing. 

In [19]:
rlc_df['violations'] = rlc_df['violations'].astype(int)
rlc_df['latitude'] = rlc_df['latitude'].astype(float)
rlc_df['longitude'] = rlc_df['longitude'].astype(float)
rlc_df['violation_date'] = pd.to_datetime(rlc_df['violation_date'])
rlc_df['month'] = rlc_df['violation_date'].apply(lambda x: int(x.month))
rlc_df['day'] = rlc_df['violation_date'].apply(lambda x: int(x.day))  # fixed from dat to day!

rlc_df['weekday'] = rlc_df['violation_date'].apply(lambda x: int(datetime.weekday(x)))
rlc_df['year'] = rlc_df['violation_date'].apply(lambda x: int(x.year))

rlc_df.head()

Unnamed: 0,intersection,camera_id,address,violation_date,violations,latitude,longitude,month,day,weekday,year
0,GRAND AND OAK PARK,1523,6800 W GRAND AVENUE,2016-01-02,5,41.923708,-87.795301,1,2,5,2016
2,RIDGE AND CLARK,1051,5930 N CLARK STREET,2016-01-02,4,41.989299,-87.670104,1,2,5,2016
3,SACRAMENTO AND CHICAGO,1812,800 N SACRAMENTO AVEN,2016-01-02,3,41.895641,-87.702352,1,2,5,2016
4,ARCHER AND CICERO,2084,5400 S ARCHER AVE,2016-01-02,3,41.798758,-87.743021,1,2,5,2016
5,WENTWORTH AND GARFIELD,2261,5500 S WENTWORTH AVEN,2016-01-02,19,,,1,2,5,2016


### Write the daily_violations TABLE - from rlc_df

In [20]:
make_table(rlc_df, 'daily_violations', c, conn)

[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('all_hours',), ('int_startend',), ('cam_locations',), ('cam_startend',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('intersection_chars',), ('daily_violations',)]


## 3) Build cam_locations TABLE - from cam_locs AND cam_startend TABLE from cam_startend

#### Make a df with info for each camera
Will contain the following:
- camera_id
- location
- start date (when was the camera turned on)
- end date (when was the camera turned off)

In [21]:
cam_df = rlc_df.copy()
cam_df['start'] = cam_df['camera_id'].apply(lambda x: None)
cam_df['end'] = cam_df['camera_id'].apply(lambda x: None)

In [22]:
cam_start = cam_df.groupby(['camera_id'])['violation_date'].min().reset_index()
cam_end = cam_df.groupby(['camera_id'])['violation_date'].max().reset_index()

cam_startend = cam_start.copy()

#print(cam_end[cam_end['camera_id']=='1503'].values[0][1])  # for testing output
cam_startend['end'] = cam_start['camera_id'].apply(lambda x: cam_end[cam_end['camera_id']==x].values[0][1])

cam_startend.rename(columns={"violation_date": "start"}, inplace=True)
                                                   
print('NA values in cam_startend:', cam_startend.isna().sum(), end='\n\n', sep='\n')

print('Describe cam_startend:', cam_startend.describe(), end='\n\n', sep='\n')



NA values in cam_startend:
camera_id    0
start        0
end          0
dtype: int64

Describe cam_startend:
       camera_id                start                  end
count        316                  316                  316
unique       316                   19                   18
top         1491  2016-01-02 00:00:00  2021-01-19 00:00:00
freq           1                  250                  244
first        NaN  2016-01-02 00:00:00  2017-05-29 00:00:00
last         NaN  2018-03-05 00:00:00  2021-01-19 00:00:00



## Make a db table that has camera locations and intersections
Intersections are present (and addresses), but we do not have lat/long info for all cams

In [23]:
# we had some incorrect data in the code below, but have a creative fix.
cam_locs = rlc_df.groupby(['camera_id', 'intersection']).max().reset_index()
cam_locs.head()

# we find there is a mismatch between lens, one of them is duplicated
len(cam_locs)  # 364 total
len(cam_locs['camera_id'].unique()) # 363

cam_locs[cam_locs['camera_id'].duplicated()]  # 1421 is dupe
print('Two of them\n', cam_locs[cam_locs['camera_id'] == '1421'])  # we see two of them
print()

# Which one is it?
print('Damen/Diversey', rlc_df[(rlc_df['camera_id']=='1421') & (rlc_df['intersection']=='DAMEN AND DIVERSEY')]['camera_id'].count())
print('Laramie/Fullerton:', rlc_df[(rlc_df['camera_id']=='1421') & (rlc_df['intersection']=='LARAMIE AND FULLERTON')]['camera_id'].count())

# Turns out that a camera has two locations. One was only used one time.  We drop it.
cam_locs = cam_locs[(cam_locs['camera_id']!='1421') | (cam_locs['intersection']!='DAMEN AND DIVERSEY')]
print("Total cams", len(cam_locs))  # 363 total (got rid of the bad one)

Two of them
    camera_id           intersection                  address violation_date  \
75      1421     DAMEN AND DIVERSEY  2000 W DIVERSEY PARKWAY     2017-11-30   
76      1421  LARAMIE AND FULLERTON    2400 N LARAMIE AVENUE     2021-01-18   

    violations   latitude  longitude  month  day  weekday  year  
75           1  41.932394 -87.678173     11   30        3  2017  
76           6  41.924152 -87.756295     12   31        6  2021  

Damen/Diversey 1
Laramie/Fullerton: 1074
Total cams 316


In [24]:
cam_locs.isna().sum()  # missing location for 19 cameras.  Let's fix it

camera_id          0
intersection       0
address            0
violation_date     0
violations         0
latitude          17
longitude         17
month              0
day                0
weekday            0
year               0
dtype: int64

Looks like we also are missing 19 of the 363 cam locations.  Let's look it up!

We actually changed this code.  We have gone to the maps to get intersection locations, but will leave old code in case we ever need exact location of cameras.

In [25]:

'''
This section goes through all of the rlc and assigns latlong
Many lights are missing it.  
For each light, there is an address though.
We use geocoding to get the latlong
'''

# let's get all of the red light cameras with their gps location.  
# This will aid in placing the accidents at rlc intersections later (if closer than threshold point to point)
# Some RLCs are missing location data,  but have addresses.  I can use geocoding I guess to look them up.


# geolocator = Nominatim(user_agent="https://github.com/sciencelee/chicago_rlc")  # please change to match repo
 
 
# def get_geocode(lat, long, address):
#     if lat > 0:  # it's a location
#         return (lat, long)
#     else: # it's a proper location tuple, and assumed to be correct latlong
#         if address in address_fix.keys(): address = address_fix[address]  # errors in the dataset chars omitted
#         # if we make it this far, we have no record of this cam_id yet, and it doesn't have a proper location
#         location = geolocator.geocode(address + ', Chicago, IL')
#         if location == None:
#             print(address+':'+address+' : could not geolocate') # print it out if we can't find (address errors)
#         else:
#             return (location.latitude, location.longitude)

        


# Got it down to one-liner for this. Found out you can't extract and assign series like you can variables
#cam_locs['location'] = cam_locs.apply(lambda x: get_geocode(x.latitude, x.longitude, x.address), axis=1)

'\nThis section goes through all of the rlc and assigns latlong\nMany lights are missing it.  \nFor each light, there is an address though.\nWe use geocoding to get the latlong\n'

In [26]:
# cam_locs['location'].head()
# cam_locs['latitude'] = cam_locs['location'].apply(lambda x: x[0])
# cam_locs['longitude'] = cam_locs['location'].apply(lambda x: x[1])

# cam_locs = cam_locs.drop(columns=['violation_date', 'violations', 'month', 'weekday', 'year', 'location'])

In [27]:
cam_locs.info()
cam_locs.isna().sum() # No longer missing location for 19 cameras.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 316 entries, 0 to 316
Data columns (total 11 columns):
camera_id         316 non-null object
intersection      316 non-null object
address           316 non-null object
violation_date    316 non-null datetime64[ns]
violations        316 non-null int64
latitude          299 non-null float64
longitude         299 non-null float64
month             316 non-null int64
day               316 non-null int64
weekday           316 non-null int64
year              316 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(5), object(3)
memory usage: 29.6+ KB


camera_id          0
intersection       0
address            0
violation_date     0
violations         0
latitude          17
longitude         17
month              0
day                0
weekday            0
year               0
dtype: int64

### Lat long fixes
During EDA, we found out that five cameras were in completely wrong lat/long location.  
Several others were located a little too far from the intersection to work properly.  When we rebuild the db, we will use bigger number than 30 m.

Here are the fixes I found easily.

In [28]:
cam_locs.head()

Unnamed: 0,camera_id,intersection,address,violation_date,violations,latitude,longitude,month,day,weekday,year
0,1002,WESTERN AND CERMAK,2200 S WESTERN AVENUE,2021-01-19,33,41.851984,-87.685786,12,31,6,2021
1,1003,WESTERN AND CERMAK,2400 W CERMAK ROAD,2021-01-19,9,41.852141,-87.685753,12,31,6,2021
2,1011,PETERSON AND WESTERN,6000 N WESTERN AVE,2021-01-19,28,41.990586,-87.689822,12,31,6,2021
3,1014,PETERSON AND WESTERN,2400 W PETERSON,2021-01-19,22,41.990609,-87.689735,12,31,6,2021
4,1023,IRVING PARK AND NARRAGANSETT,6400 W IRVING PK,2021-01-18,11,41.953025,-87.786683,12,31,6,2021


### Create cam_locations TABLE - from cam_locs AND cam_startend from cam_startend

In [29]:
make_table(cam_locs, 'cam_locations', c, conn)
make_table(cam_startend, 'cam_startend', c, conn)

[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('all_hours',), ('int_startend',), ('cam_startend',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('intersection_chars',), ('daily_violations',), ('cam_locations',)]
[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('all_hours',), ('int_startend',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('intersection_chars',), ('daily_violations',), ('cam_locations',), ('cam_startend',)]


###  We still have missing lat/long info for our rlc_df.  Let's fix it
Before moving on.  Now that we have cam_locs, we can fix our rlc_df

In [30]:
rlc_df.isna().sum()  


intersection          0
camera_id             0
address               0
violation_date        0
violations            0
latitude          25074
longitude         25074
month                 0
day                   0
weekday               0
year                  0
dtype: int64

## Decided to eliminate cam position in favor of intersection lat/long
I hope this makes all of my position data consistent for gathering crash info.
When using cam location, it is sometimes up to 35 m up road where cam position is.  This would cause us to misidentify crashes from other intersections or miss some in the intersection of interest.  R

Remedy: Use center point of intersection for all cams.

How to do it:  I will change cam_locs data to match intersection instead of individual camera.

In [31]:
int_df.columns


Index(['protected_turn', 'total_lanes', 'medians', 'exit', 'split', 'way',
       'underpass', 'no_left', 'angled', 'triangle', 'one_way', 'turn_lanes',
       'lat', 'long', 'rlc', 'intersection', 'daily_traffic'],
      dtype='object')

In [32]:
def location_correction(int_df, intersect, latlong):
    # lookup function from intersection df to get the lat long
    # int_df is the intersection characteristic frame from 1) above
    # intersect is the intersection name used to link tables/df
    # latlong is either 'lat' or 'long'
    if latlong == 'lat':
        lat = int_df[int_df['intersection']==intersect]['lat'].values[0]
        if lat==None: print(lat, intersect)
        return lat
    else:
        long = int_df[int_df['intersection']==intersect]['long'].values[0]
        return long

cam_locs['latitude'] = cam_locs['intersection'].apply(lambda x: location_correction(int_df, x, 'lat'))
cam_locs['longitude'] = cam_locs['intersection'].apply(lambda x: location_correction(int_df, x, 'long'))

In [33]:
rlc_df.intersection.head()

0        GRAND AND OAK PARK
2           RIDGE AND CLARK
3    SACRAMENTO AND CHICAGO
4         ARCHER AND CICERO
5    WENTWORTH AND GARFIELD
Name: intersection, dtype: object

In [34]:
int_df[int_df['intersection']=='IRVING PARK AND KILPATRICK']

Unnamed: 0,protected_turn,total_lanes,medians,exit,split,way,underpass,no_left,angled,triangle,one_way,turn_lanes,lat,long,rlc,intersection,daily_traffic
IRVING PARK AND KILPATRICK,1,6,0,0,0,4,0,0,0,0,1,3,41.953395,-87.744635,1,IRVING PARK AND KILPATRICK,37100


Make my cameras have a location that is center of intersection instead of exact cam location.

In [35]:
#⏳⏳⏳⏳⏳⏳
# THIS TAKES SOME TIME (8min on my macbook pro)
def read_loc(int_df, intersection):
    cam = int_df[int_df['intersection']==intersection]
    #print(cam)
    return (float(cam['lat']), float(cam['long']))
        


# create a location column so we only have to do it once
rlc_df['location'] = rlc_df['intersection'].apply(lambda x: read_loc(int_df, x))
rlc_df.head()

Unnamed: 0,intersection,camera_id,address,violation_date,violations,latitude,longitude,month,day,weekday,year,location
0,GRAND AND OAK PARK,1523,6800 W GRAND AVENUE,2016-01-02,5,41.923708,-87.795301,1,2,5,2016,"(41.92359678755598, -87.79523664215334)"
2,RIDGE AND CLARK,1051,5930 N CLARK STREET,2016-01-02,4,41.989299,-87.670104,1,2,5,2016,"(41.98966718426217, -87.66998823909383)"
3,SACRAMENTO AND CHICAGO,1812,800 N SACRAMENTO AVEN,2016-01-02,3,41.895641,-87.702352,1,2,5,2016,"(41.89559271274954, -87.70223070169483)"
4,ARCHER AND CICERO,2084,5400 S ARCHER AVE,2016-01-02,3,41.798758,-87.743021,1,2,5,2016,"(41.798660621398106, -87.74286983575124)"
5,WENTWORTH AND GARFIELD,2261,5500 S WENTWORTH AVEN,2016-01-02,19,,,1,2,5,2016,"(41.794413041451094, -87.6305016931052)"


In [42]:
rlc_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 480708 entries, 0 to 480824
Data columns (total 11 columns):
intersection      480708 non-null object
camera_id         480708 non-null object
address           480708 non-null object
violation_date    480708 non-null datetime64[ns]
violations        480708 non-null int64
latitude          480708 non-null float64
longitude         480708 non-null float64
month             480708 non-null int64
day               480708 non-null int64
weekday           480708 non-null int64
year              480708 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(5), object(3)
memory usage: 44.0+ MB


In [44]:
# then add in the new lat longs to the df
rlc_df['latitude'] = rlc_df['location'].apply(lambda x: x[0])
rlc_df['longitude'] = rlc_df['location'].apply(lambda x: x[1])

In [45]:
rlc_df[rlc_df.latitude.isna()]['intersection'].unique()  # which intersections am I still missing

array([], dtype=object)

In [46]:
rlc_df.head()

Unnamed: 0,intersection,camera_id,address,violation_date,violations,latitude,longitude,month,day,weekday,year
0,GRAND AND OAK PARK,1523,6800 W GRAND AVENUE,2016-01-02,5,41.923597,-87.795237,1,2,5,2016
2,RIDGE AND CLARK,1051,5930 N CLARK STREET,2016-01-02,4,41.989667,-87.669988,1,2,5,2016
3,SACRAMENTO AND CHICAGO,1812,800 N SACRAMENTO AVEN,2016-01-02,3,41.895593,-87.702231,1,2,5,2016
4,ARCHER AND CICERO,2084,5400 S ARCHER AVE,2016-01-02,3,41.798661,-87.74287,1,2,5,2016
5,WENTWORTH AND GARFIELD,2261,5500 S WENTWORTH AVEN,2016-01-02,19,41.794413,-87.630502,1,2,5,2016


In [47]:
if 'location' in rlc_df.columns:
    rlc_df.drop(columns=['location'], inplace=True)

In [48]:
make_table(rlc_df, 'daily_violations', c, conn)

[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('all_hours',), ('int_startend',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',)]


## 4) Create intersection_cams TABLE - from int_cams

### We will now focus on trying to bring rlc intersections to our crashes
We find that we have 363 cameras at 183 intersections

In [49]:
len(cam_locs.latitude.unique())

159

In [50]:
int_cams = cam_locs.groupby(['intersection']) \
                    .agg({'latitude':pd.Series.max, 'longitude':pd.Series.max,}) \
                    .reset_index()

int_cams['cam1'] = int_cams['intersection'] \
                            .apply(lambda x: cam_locs[cam_locs['intersection']==x]['camera_id'].iloc[0])

int_cams['cam2'] = int_cams['intersection'].apply( \
                            lambda x: None if len(cam_locs[cam_locs['intersection']==x])==1 \
                            else cam_locs[cam_locs['intersection']==x]['camera_id'].iloc[1])

int_cams['cam3'] = int_cams['intersection'].apply( \
                            lambda x: None if len(cam_locs[cam_locs['intersection']==x])<3 \
                            else cam_locs[cam_locs['intersection']==x]['camera_id'].iloc[2])                             

int_cams.head()

Unnamed: 0,intersection,latitude,longitude,cam1,cam2,cam3
0,111TH AND HALSTED,41.692362,-87.642423,2422,2424,
1,115TH AND HALSTED,41.685089,-87.642094,2552,2553,
2,119TH AND HALSTED,41.677774,-87.64193,2402,2404,
3,31ST ST AND MARTIN LUTHER KING DRIVE,41.838441,-87.617338,2121,2123,
4,35TH AND WESTERN,41.830281,-87.684775,2091,2092,


In [51]:
print('Total Cameras', len(cam_locs))
print('Total Intersections', len(int_cams))

Total Cameras 316
Total Intersections 159


### Create intersection_cams TABLE - from int_cams
first we add a column for region to each of my intersections
#### Should come back and add this later.  Need to also bring in congestion data though.

In [52]:
make_table(int_cams, 'intersection_cams', c, conn)

[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('all_hours',), ('int_startend',), ('all_crashes',), ('signal_crashes',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',)]


## 5) Create signal_crashes TABLE - from crash_df AND all_crashes - from crash_df (pre)

In [53]:
# Crash Data
crash_data = client.get("85ca-t3if", 
                     where="crash_date > \'2016-01-01T00:00:00.000\'",
                     limit=1000000,
                    )

crash_df = pd.DataFrame.from_records(crash_data) # Convert to pandas DataFrame

### Crash data preprocessing

In [54]:
# drop a few columns we don't need, including location (we have lat/long)
dropme = ['statements_taken_i', 'private_property_i', 'photos_taken_i', 'dooring_i', 'date_police_notified','location']

crash_df.drop(columns=dropme, inplace=True)

In [55]:
crash_df.isna().sum()

crash_record_id                       0
rd_no                              3289
crash_date                            0
posted_speed_limit                    0
traffic_control_device                0
device_condition                      0
weather_condition                     0
lighting_condition                    0
first_crash_type                      0
trafficway_type                       0
alignment                             0
roadway_surface_cond                  0
road_defect                           0
report_type                       11469
crash_type                            0
damage                                0
prim_contributory_cause               0
sec_contributory_cause                0
street_no                             0
street_direction                      3
street_name                           1
beat_of_occurrence                    5
num_units                             0
most_severe_injury                  951
injuries_total                      940


We have 2.5k entries that have no location.  Let's drop them

In [56]:
crash_df.dropna(subset=['latitude',], inplace=True)  # get rid of na locations

### Let's look at what is in the data   

In [57]:
# What's in this data?
col_interest = ['traffic_control_device', 'device_condition', 'weather_condition',
       'lighting_condition', 'first_crash_type', 'trafficway_type',
       'alignment', 'roadway_surface_cond', 'road_defect', 'report_type',
       'crash_type', 'hit_and_run_i', 'damage', 'prim_contributory_cause',
       'sec_contributory_cause', 'street_no', 'street_direction',
       'street_name', 'beat_of_occurrence', 'num_units', 'most_severe_injury', 
        'injuries_fatal', 'injuries_incapacitating',
       'injuries_non_incapacitating', 'injuries_reported_not_evident',
       'injuries_no_indication', 'injuries_unknown', 'crash_hour',
       'crash_day_of_week', 'crash_month', 'latitude', 'longitude', 'lane_cnt',
       'intersection_related_i', 'crash_date_est_i',
       'work_zone_i', 'work_zone_type',
       'workers_present_i']

for col in col_interest:
    print(col, crash_df[col].unique())

traffic_control_device ['NO CONTROLS' 'STOP SIGN/FLASHER' 'TRAFFIC SIGNAL' 'UNKNOWN'
 'OTHER REG. SIGN' 'LANE USE MARKING' 'DELINEATORS' 'POLICE/FLAGMAN'
 'RAILROAD CROSSING GATE' 'FLASHING CONTROL SIGNAL' 'SCHOOL ZONE'
 'OTHER RAILROAD CROSSING' 'RR CROSSING SIGN' 'NO PASSING'
 'BICYCLE CROSSING SIGN']
device_condition ['NO CONTROLS' 'FUNCTIONING PROPERLY' 'NOT FUNCTIONING' 'UNKNOWN' 'OTHER'
 'FUNCTIONING IMPROPERLY' 'WORN REFLECTIVE MATERIAL' 'MISSING']
weather_condition ['CLEAR' 'RAIN' 'UNKNOWN' 'SNOW' 'CLOUDY/OVERCAST' 'SLEET/HAIL'
 'FREEZING RAIN/DRIZZLE' 'FOG/SMOKE/HAZE' 'OTHER' 'BLOWING SNOW'
 'SEVERE CROSS WIND GATE' 'BLOWING SAND, SOIL, DIRT']
lighting_condition ['DAYLIGHT' 'DARKNESS' 'DARKNESS, LIGHTED ROAD' 'UNKNOWN' 'DAWN' 'DUSK']
first_crash_type ['TURNING' 'REAR END' 'PARKED MOTOR VEHICLE'
 'SIDESWIPE OPPOSITE DIRECTION' 'ANGLE' 'SIDESWIPE SAME DIRECTION'
 'OTHER OBJECT' 'HEAD ON' 'PEDESTRIAN' 'FIXED OBJECT' 'PEDALCYCLIST'
 'REAR TO FRONT' 'REAR TO SIDE' 'REAR TO REAR' 'A

### Filter for desired crashes (intersections with signal)
This helps us.  
We can filter 'traffic_control_device' == 'TRAFFIC SIGNAL'.  
We can filter 'intersection_related_i' == 'Y'

This will leave us with only crashes that occurred at/because of intersections, and with a signal at the intersection.

intersection_related_i: A field observation by the police officer whether an intersection played a role in the crash. Does not represent whether or not the crash occurred within the intersection.

In [58]:
make_table(crash_df, 'all_crashes', c, conn)

crash_df = crash_df[(crash_df['traffic_control_device']=='TRAFFIC SIGNAL') & \
                    (crash_df['intersection_related_i']=='Y')]

[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('all_hours',), ('int_startend',), ('signal_crashes',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',)]


### Add the intersection to my crashes

Will look up each lat long for crash and get the corresponding intersection if we have it.
Look up using geo equations took forever.  Also slow using pythag thrm.

In [59]:
#⏳⏳⏳⏳⏳
# This takes too long to process.  Let's simplify it and make it a box instead.
box_side = 100  # effectively makes it check for crash being within 40m of intersection
box_lat = box_side / 111070 / 2 # 111070 is meters in deg lat in Chicago
box_long = box_side / 83000 / 2 # 83000 is meters in deg long in Chicago

def box_check(lat, long, int_df):
    answer = (int_df[  (int_df['lat'] > (lat - box_lat)) & 
                      (int_df['lat'] < (lat + box_lat)) &
                      (int_df['long'] > (long - box_long)) &
                      (int_df['long'] < (long + box_long))
                     ])
    if answer.empty: return None
    return answer['intersection'].values[0]
    
# THIS SEEMS TO WORK WITH SPEED AND ELIMINATES MEMORY PROBLEM
crash_df['intersection'] = crash_df.apply(lambda x: box_check(float(x.latitude), 
                                                              float(x.longitude), 
                                                              int_df), axis=1)

In [60]:
# #Ex: df[['two', 'three']] = df[['two', 'three']].astype(float)
crash_df['crash_date'] = pd.to_datetime(crash_df['crash_date'])
crash_df['year'] = crash_df['crash_date'].apply(lambda x: int(x.year))
crash_df['month'] = crash_df['crash_date'].apply(lambda x: int(x.month))
crash_df['day'] = crash_df['crash_date'].apply(lambda x: int(x.day))
crash_df['hour'] = crash_df['crash_date'].apply(lambda x: int(x.hour))

In [61]:
crash_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60158 entries, 5 to 463869
Data columns (total 48 columns):
crash_record_id                  60158 non-null object
rd_no                            59698 non-null object
crash_date                       60158 non-null datetime64[ns]
posted_speed_limit               60158 non-null object
traffic_control_device           60158 non-null object
device_condition                 60158 non-null object
weather_condition                60158 non-null object
lighting_condition               60158 non-null object
first_crash_type                 60158 non-null object
trafficway_type                  60158 non-null object
alignment                        60158 non-null object
roadway_surface_cond             60158 non-null object
road_defect                      60158 non-null object
report_type                      58343 non-null object
crash_type                       60158 non-null object
damage                           60158 non-null object
pr

### Create signal_crashes TABLE - from crash_df

In [62]:
make_table(crash_df, 'signal_crashes', c, conn)

[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('all_hours',), ('int_startend',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',)]


## 6) Create hourly_congestion TABLE from all_traffic
For this one, we have to combine two different datasets.  Chicago changed the way data was recorded in 2018.  Columns are similar, but more data collected.

In [63]:
# Congestion Data
traffic_df = client.get("emtn-qqdi", 
                     #where="TIME > \'2015-01-01T00:00:00.000\'",
                     where='TIME > \'2016-01-01T00:00:00.000\'',
                     limit=10000000,
                    )

traffic_df = pd.DataFrame.from_records(traffic_df) # Convert to pandas DataFrame

### Clean up my datatypes before preprocessing
Won't be able to table it until we get both datasets

In [64]:
traffic_df.rename(columns={'number_of_reads':'num_reads'}, inplace=True)
traffic_df['time'] = pd.to_datetime(traffic_df['time'])
traffic_df['bus_count'] = traffic_df['bus_count'].astype(int)
traffic_df['num_reads'] = traffic_df['num_reads'].astype(int)
traffic_df['speed'] = traffic_df['speed'].astype(float)

### On to the other dataset

In [65]:
# Congestion data from later
traffic_df2 = client.get("kf7e-cur8", #2018 to present
                     select='time, region_id, speed, bus_count, num_reads',  # this set is huge, so we won't get all       
                     where="TIME < \'2021-01-01T00:00:00.000\'",
                     limit=10000000,
                    )

# Convert to pandas DataFrame
traffic_df2 = pd.DataFrame.from_records(traffic_df2)


In [66]:
#traffic2_df.rename(columns={'number_of_reads':'num_reads'}, inplace=True)
traffic_df2['time'] = pd.to_datetime(traffic_df2['time'])
traffic_df2['bus_count'] = traffic_df2['bus_count'].astype(int)
traffic_df2['num_reads'] = traffic_df2['num_reads'].astype(int)
traffic_df2['speed'] = traffic_df2['speed'].astype(float)

## Now get the congestion data processed
We have two separate traffic_dfs.  There is data prior to 2018 and after in two different api endpoints.


In [67]:
traffic_df.head()
traffic_df2.head()
traffic_df2.info()
print()
traffic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3959185 entries, 0 to 3959184
Data columns (total 5 columns):
time         datetime64[ns]
region_id    object
speed        float64
bus_count    int64
num_reads    int64
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 151.0+ MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3507463 entries, 0 to 3507462
Data columns (total 5 columns):
time         datetime64[ns]
region_id    object
bus_count    int64
num_reads    int64
speed        float64
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 133.8+ MB


In [68]:
# Merge my two data sets for congestion by region
all_traffic = pd.merge(traffic_df, traffic_df2, how='outer')
print('traffic dfs merged')

traffic dfs merged


In [69]:
all_traffic['hour'] = all_traffic['time'].dt.hour
print('added hour column')

all_traffic['day'] = all_traffic.time.dt.day
print('added day column')

all_traffic['month'] = all_traffic.time.dt.month
print('added month column')

all_traffic['year'] = all_traffic.time.dt.year
print('added year column')

all_traffic['weekday'] = all_traffic.time.dt.weekday
print('added weekday column')

added hour column
added day column
added month column
added year column
added weekday column


In [70]:
print(len(all_traffic))  # lots of dupes 
all_traffic = all_traffic.groupby(['year', 'month', 'day', 'hour', 'region_id']).mean().reset_index()
all_traffic.info()

7256572
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1242940 entries, 0 to 1242939
Data columns (total 9 columns):
year         1242940 non-null int64
month        1242940 non-null int64
day          1242940 non-null int64
hour         1242940 non-null int64
region_id    1242940 non-null object
bus_count    1242940 non-null float64
num_reads    1242940 non-null float64
speed        1242940 non-null float64
weekday      1242940 non-null float64
dtypes: float64(4), int64(4), object(1)
memory usage: 85.3+ MB


In [71]:
all_traffic.head()

Unnamed: 0,year,month,day,hour,region_id,bus_count,num_reads,speed,weekday
0,2016,1,1,0,1,6.8,120.2,27.062,4.0
1,2016,1,1,0,10,34.0,390.8,24.834,4.0
2,2016,1,1,0,11,16.2,265.8,26.004,4.0
3,2016,1,1,0,12,25.0,350.8,16.526,4.0
4,2016,1,1,0,13,33.6,511.4,18.136,4.0


In [72]:
# couple minutes
make_table(all_traffic, 'hourly_congestion', c, conn)

[('hourly_weather',), ('region_data',), ('all_hours',), ('int_startend',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('hourly_congestion',)]


# Speed fix for congestion
Congestion is measured by average bus speed.

The problem:
- Overnight (between 11 and 5am) we have few bus routes.
- Some regions have no buses overnight
- Some regions have only a few buses 
- Some buses are ending routes and have only a few reads
- Some buses are stationary (next morning staging)

The fix:
- replace speed for few buses/reads if speed is low
- we assume low buses/reads to be overnight when congestion is minimal
- replacement speed is a low congestion quantile speed (90% or so)

In [73]:
## Let's get the 0.90 quantile for every region, and then use that to fill in missing data

regions_90 = all_traffic.groupby(['region_id'])['speed'].quantile(0.9).reset_index()


In [74]:
regions_90.head()


Unnamed: 0,region_id,speed
0,1,25.068333
1,10,26.4534
2,11,26.808033
3,12,23.136667
4,13,24.143667


In [75]:
#### 5 MINUTES OR SO

#my read on this is that few buses run 24/7, so the data is unreliable.  
# buses stage for next morning.  You can see them all along Clark, LSD etc.  
# They have speed=0 and may be recording.  Could talk to owner of dataset.

# I will draw the cutoff at 100 reads, 5 buses, speed < 10
# in that case I will put in a quantile speed for the region


def speed_check(bus, speed, reads, region_id, regions_90):
    if (bus <= 5 or reads < 100) and speed < 25 or speed > 40:
        return regions_90[regions_90['region_id']==region_id]['speed'].values[0]
    else:
        return speed
    

# apply is SLOOOOOOWWW, but not sure how else to accomplish this without iter
all_traffic['speed'] = all_traffic.apply(lambda x: speed_check(x.bus_count, x.speed, x.num_reads, x.region_id, regions_90), axis=1)
      


In [76]:
make_table(all_traffic, 'hourly_congestion', c, conn)

[('hourly_weather',), ('region_data',), ('all_hours',), ('int_startend',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('hourly_congestion',)]


## 7) Create hourly_weather from wx_df

In [77]:
# Import weather data from csv
wx_df = pd.read_csv('data/chi_wx.csv')

In [78]:
wx_df.head()

Unnamed: 0,dt,dt_iso,timezone,city_name,lat,lon,temp,feels_like,temp_min,temp_max,...,wind_deg,rain_1h,rain_3h,snow_1h,snow_3h,clouds_all,weather_id,weather_main,weather_description,weather_icon
0,1420070400,2015-01-01 00:00:00 +0000 UTC,-21600,Chicago IL. USA,41.878114,-87.629798,265.96,258.16,264.85,267.708,...,230,,,,,1,800,Clear,sky is clear,01n
1,1420074000,2015-01-01 01:00:00 +0000 UTC,-21600,Chicago IL. USA,41.878114,-87.629798,266.13,256.52,265.35,267.926,...,230,,,,,20,801,Clouds,few clouds,02n
2,1420077600,2015-01-01 02:00:00 +0000 UTC,-21600,Chicago IL. USA,41.878114,-87.629798,266.17,257.7,265.35,268.098,...,230,,,,,20,801,Clouds,few clouds,02n
3,1420081200,2015-01-01 03:00:00 +0000 UTC,-21600,Chicago IL. USA,41.878114,-87.629798,266.39,257.56,265.35,268.157,...,240,,,,,1,800,Clear,sky is clear,01n
4,1420084800,2015-01-01 04:00:00 +0000 UTC,-21600,Chicago IL. USA,41.878114,-87.629798,266.47,256.5,265.35,268.121,...,240,,,,,1,800,Clear,sky is clear,01n


In [79]:
wx_df['time'] = pd.to_datetime(wx_df['dt_iso'].apply(lambda x: x[:-4]))
wx_df.head()
wx_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55219 entries, 0 to 55218
Data columns (total 26 columns):
dt                     55219 non-null int64
dt_iso                 55219 non-null object
timezone               55219 non-null int64
city_name              55219 non-null object
lat                    55219 non-null float64
lon                    55219 non-null float64
temp                   55219 non-null float64
feels_like             55219 non-null float64
temp_min               55219 non-null float64
temp_max               55219 non-null float64
pressure               55219 non-null int64
sea_level              0 non-null float64
grnd_level             0 non-null float64
humidity               55219 non-null int64
wind_speed             55219 non-null float64
wind_deg               55219 non-null int64
rain_1h                6587 non-null float64
rain_3h                816 non-null float64
snow_1h                1538 non-null float64
snow_3h                91 non-null float6

In [80]:
wx_df['rain_3h'] = wx_df['rain_3h'].fillna(0)
wx_df['rain_1h'] = wx_df['rain_1h'].fillna(0)
wx_df['snow_3h'] = wx_df['snow_3h'].fillna(0)
wx_df['snow_1h'] = wx_df['snow_1h'].fillna(0)
wx_df['temp'] = wx_df['temp_max']
wx_df['year'] = wx_df.time.dt.year
wx_df['month'] = wx_df.time.dt.month
wx_df['day'] = wx_df.time.dt.day
wx_df['hour'] = wx_df.time.dt.hour
wx_df['weekday'] = wx_df.time.dt.weekday

In [81]:
wx_df.describe()

Unnamed: 0,dt,timezone,lat,lon,temp,feels_like,temp_min,temp_max,pressure,sea_level,...,rain_3h,snow_1h,snow_3h,clouds_all,weather_id,year,month,day,hour,weekday
count,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,0.0,...,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0
mean,1513951000.0,-19252.525399,41.87811,-87.6298,285.518514,280.257647,281.898104,285.518514,1016.055542,,...,0.047967,0.013533,0.001809,61.031294,750.721907,2017.485087,6.395969,15.782557,11.455894,2.998443
std,53882850.0,1714.737534,7.105492e-15,4.263295e-14,11.064564,13.182997,10.354815,11.064564,7.584028,,...,0.708266,0.123137,0.066962,32.04714,112.093082,1.695629,3.444176,8.817962,6.908119,1.999746
min,1420070000.0,-21600.0,41.87811,-87.6298,245.37,233.18,242.15,245.37,965.0,,...,0.0,0.0,0.0,0.0,200.0,2015.0,1.0,1.0,0.0,0.0
25%,1467211000.0,-21600.0,41.87811,-87.6298,276.48,269.89,274.15,276.48,1011.0,,...,0.0,0.0,0.0,40.0,800.0,2016.0,3.0,8.0,5.0,1.0
50%,1514524000.0,-18000.0,41.87811,-87.6298,285.193,279.25,281.161,285.193,1016.0,,...,0.0,0.0,0.0,75.0,802.0,2017.0,6.0,16.0,11.0,3.0
75%,1560530000.0,-18000.0,41.87811,-87.6298,295.15,291.9,290.95,295.15,1021.0,,...,0.0,0.0,0.0,90.0,803.0,2019.0,9.0,23.0,17.0,5.0
max,1606864000.0,-18000.0,41.87811,-87.6298,311.48,309.89,306.132,311.48,1044.0,,...,35.0,8.4,6.0,100.0,804.0,2020.0,12.0,31.0,23.0,6.0


In [82]:
try:
    wx_df = wx_df.drop(columns=['dt', 
                        'dt_iso', 
                        'timezone', 
                        'city_name', 
                        'lat', 
                        'lon', 
                        'feels_like', 
                        'temp_min', 
                        'temp_max',
                        'pressure',
                        'sea_level',
                        'grnd_level',
                        'humidity',
                        'wind_speed',
                        'wind_deg',
                        'clouds_all',
                        'weather_description',
                        'weather_icon',
                        'weather_id',
                        'weather_main',
                       ], axis=1)
except:
    print('Failed')

In [83]:
print(len(wx_df))
print(wx_df.duplicated().sum())


print('Total hours in 6 years:', 365.25 * 24 * 6)
print('Unique entries:', len(wx_df.drop_duplicates()))  
# missing a few entries (700+ out of 52k)  Am I missing a month??

print()
print(wx_df.time.min(), wx_df.time.max())  # OH!!!!  I am missin last month
print('Total hours in 6 years (-1 mos):', 365.25 * 24 * 6 - 31 * 24)  # okay, we are only missing a few


wx_df.drop_duplicates(inplace=True)

55219
3331
Total hours in 6 years: 52596.0
Unique entries: 51888

2015-01-01 00:00:00+00:00 2020-12-01 23:00:00+00:00
Total hours in 6 years (-1 mos): 51852.0


In [84]:
make_table(wx_df, 'hourly_weather', c, conn)

[('region_data',), ('all_hours',), ('int_startend',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('hourly_congestion',), ('hourly_weather',)]


## 8) Create TABLE region_data from region_df

In [85]:
# THis time we only grab what we need

region_df = client.get("kf7e-cur8", # regional congestion current data
                         select='region_id, region, description, north, south, east, west',
                         limit=1000
                    )

# Convert to pandas DataFrame
region_df = pd.DataFrame.from_records(region_df)  # should only return most recent for each region

In [86]:
region_df = region_df.groupby('region_id').max().reset_index()

In [87]:
# need these as floats so we can compare them
region_df[['north', 'south', 'east', 'west']] = region_df[['north', 'south', 'east', 'west']].astype(float)
region_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 7 columns):
region_id      29 non-null object
region         29 non-null object
description    29 non-null object
north          29 non-null float64
south          29 non-null float64
east           29 non-null float64
west           29 non-null float64
dtypes: float64(4), object(3)
memory usage: 1.7+ KB


### Add region to my crash df

In [88]:
crash_df[['latitude', 'longitude']] = crash_df[['latitude', 'longitude']].astype(float)
crash_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60158 entries, 5 to 463869
Data columns (total 48 columns):
crash_record_id                  60158 non-null object
rd_no                            59698 non-null object
crash_date                       60158 non-null datetime64[ns]
posted_speed_limit               60158 non-null object
traffic_control_device           60158 non-null object
device_condition                 60158 non-null object
weather_condition                60158 non-null object
lighting_condition               60158 non-null object
first_crash_type                 60158 non-null object
trafficway_type                  60158 non-null object
alignment                        60158 non-null object
roadway_surface_cond             60158 non-null object
road_defect                      60158 non-null object
report_type                      58343 non-null object
crash_type                       60158 non-null object
damage                           60158 non-null object
pr

In [89]:
# add in the region for my crashes
# Resource hog
crash_df.columns


def which_region(lat, long, region_df):
    #print(lat, long)
    row = region_df[(region_df['east'] >= long) &
                    (region_df['west'] < long) &
                    (region_df['north'] >= lat) &
                    (region_df['south'] < lat)]['region_id'].max()
    return row

#df.iloc[:5]
# takes some 5min
crash_df['region_id'] = crash_df.apply(lambda x: which_region(x.latitude, x.longitude, region_df), axis=1)

In [90]:
len(crash_df)
crash_df.columns

crash_df['time'] = pd.to_datetime(crash_df.crash_date)
crash_df['year'] = crash_df.time.dt.year
crash_df['month'] = crash_df.time.dt.month
crash_df['day'] = crash_df.time.dt.day
crash_df['hour'] = crash_df.time.dt.hour
crash_df['weekday'] = crash_df.time.dt.weekday

In [91]:
make_table(region_df, 'region_data', c, conn)
print()
make_table(crash_df, 'signal_crashes', c, conn)  # also update my crash data

[('all_hours',), ('int_startend',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',)]

[('all_hours',), ('int_startend',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('signal_crashes',)]


### Add region_id to intersection_cams
While I'm here and I have the function ready.
I would like to add region_id number to my red light camera (daily_violations TABLE)
The region there will help me link the daily_violations and hourly_congestion TABLEs

*** NOTE: Makes more sense to come back and put the region into the intersection_cameras table to speed this up

In [92]:
# 1 minutes
#rlc['region_id'] = 
int_cams['region_id'] = int_cams.apply(lambda x: which_region(x.latitude, x.longitude, region_df), axis=1)


In [93]:
## commit my change
make_table(int_cams, 'intersection_cams', c, conn)

[('all_hours',), ('int_startend',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('signal_crashes',), ('intersection_cams',)]


### Use this code to test any of your tables for proper data storage

In [94]:
query = c.execute("SELECT camera_id, violations FROM daily_violations;").fetchall()
print(query[:5])
print(len(query))

[('1523', 5), ('1051', 4), ('1812', 3), ('2084', 3), ('2261', 19)]
480708


## Before I go, I want to add intersections to my crashes to link the db tables

In [95]:
sql_fetch_tables(c, conn)

[('all_hours',),
 ('int_startend',),
 ('intersection_chars',),
 ('cam_locations',),
 ('cam_startend',),
 ('daily_violations',),
 ('all_crashes',),
 ('hourly_congestion',),
 ('hourly_weather',),
 ('region_data',),
 ('signal_crashes',),
 ('intersection_cams',)]

In [96]:
df = pd.read_sql_query("SELECT * FROM signal_crashes", conn)
camloc_df = pd.read_sql_query('SELECT * FROM cam_locations', conn)
ints_df = pd.read_sql_query('SELECT * FROM intersection_cams', conn)


In [97]:
#ints_df.astype({'longitude':float})
pd.options.display.max_rows = 200


In [98]:
# Now I am desperate.  This takes too long to process.  Let's simplify it and make it a box instead.
box_side = 100  # effectively makes it check for crash being within 25m of interscection
box_lat = box_side / 111070 / 2 # 111070 is meters in deg lat in Chicago
box_long = box_side / 83000 / 2 # 83000 is meters in deg long in Chicago

def box_check(lat, long, int_df):
    n = lat + box_lat
    s = lat - box_lat
    e = long + box_long
    w = long - box_long
    # print('n', n, 's', s, 'e', e, 'w', w, 'lat:', lat, 'long:', long)
    answer = int_df[  (int_df['lat'] > s) &
                      (int_df['lat'] < n) &
                      (int_df['long'] > w) &
                      (int_df['long'] < e)
                      
                     ]
    if answer.empty: return None
    return answer['intersection'].values[0]

    
# THIS SEEMS TO WORK AT SPEED AND ELIMINATES MEMORY PROBLEM
for i in range(5000, 5200): 
    lat = float(df.iloc[i]['latitude'])
    long = float(df.iloc[i]['longitude'])
    n = lat + box_lat
    s = lat - box_lat
    e = long + box_long
    w = long - box_long
    answer = int_df[  (int_df['lat'] > s) &
                      (int_df['lat'] < n) &
                      (int_df['long'] > w) &
                      (int_df['long'] < e)]['intersection'].values
    if len(answer): print(answer[0])
    
# 99th Halsted: 41.714230	-87.643043
# MOMENT OF TRUTH
df['intersection'] = df.apply(lambda x: box_check(float(x.latitude), float(x.longitude), int_df), axis=1)



IRVING PARK AND KEDZIE
CICERO AND CHICAGO
STATE AND 79TH
ASHLAND AND 63RD
IRVING PARK AND NARRAGANSETT
STATE AND 79TH
KEDZIE AND 71ST
BROADWAY/SHERIDAN AND DEVON
ASHLAND AND IRVING PARK
STONEY ISLAND AND 76TH
GRAND AND OAK PARK
111TH AND HALSTED
PETERSON AND WESTERN
CLARK AND FULLERTON
CICERO AND FULLERTON
KIMBALL AND DIVERSEY
CICERO AND WASHINGTON
KOSTNER AND NORTH
COLUMBUS AND ILLINOIS
WESTERN AND ARMITAGE
MADISON AND CENTRAL
BELMONT AND KEDZIE
HALSTED AND FULLERTON
RIDGE AND CLARK
NORTHWEST HIGHWAY AND FOSTER
55TH and PULASKI
ELSTON AND IRVING PARK
ROOSEVELT AND PULASKI
WESTERN AND NORTH
BELMONT AND KEDZIE


In [99]:
df.intersection.count() / len(df)
len(df.intersection.unique())

182

In [100]:
make_table(df, 'signal_crashes', c, conn)


[('all_hours',), ('int_startend',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('signal_crashes',)]


### Make a table that just has every date and hour for every intersection.  Will help out my queries

In [95]:
int_chars = pd.read_sql_query("SELECT * FROM intersection_chars", conn)
wx_df = pd.read_sql_query("SELECT * FROM hourly_weather", conn)

In [96]:
my_ints = int_chars.intersection

In [99]:
grouped_dates = wx_df.groupby(['year', 'month', 'day']).sum().reset_index()[['year', 'month', 'day']]
grouped_dates = grouped_dates[grouped_dates['year']>2015]

In [100]:
# didn't work 
grouped_dates.head()
grouped_dates.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1797 entries, 365 to 2161
Data columns (total 3 columns):
year     1797 non-null int64
month    1797 non-null int64
day      1797 non-null int64
dtypes: int64(3)
memory usage: 56.2 KB


In [105]:
# Should only have to do this once
#for i in range(len(wx_df)):
big_df = grouped_dates.copy()
big_df['intersection'] = int_chars.iloc[0,:]

for i in range(1, len(int_chars)):
    print(int_chars.iloc[i, :]['intersection'])
    df = grouped_dates.copy()
    df['intersection'] = int_chars.iloc[i,:]['intersection']
    big_df = pd.concat([big_df, df])



115TH AND HALSTED
119TH AND HALSTED
31ST AND CALIFORNIA
31ST ST AND MARTIN LUTHER KING DRIVE
35TH AND WESTERN
4700 WESTERN
55TH AND KEDZIE
55TH AND WESTERN
55TH and PULASKI
63RD AND STATE
71ST AND ASHLAND
75TH AND STATE
79TH AND HALSTED
79TH AND KEDZIE
83RD AND STONY ISLAND
87TH AND VINCENNES
95TH AND STONEY ISLAND
99TH AND HALSTED
ADDISON AND HARLEM
ARCHER AND CICERO
ARCHER/NARRAGANSETT AND 55TH
ASHLAND AND 47TH
ASHLAND AND 63RD
ASHLAND AND 87TH
ASHLAND AND 95TH
ASHLAND AND ARCHER
ASHLAND AND DIVERSEY
ASHLAND AND DIVISION
ASHLAND AND FULLERTON
ASHLAND AND IRVING PARK
ASHLAND AND LAWRENCE
ASHLAND AND MADISON
AUSTIN AND ADDISON
AUSTIN AND IRVING PARK
BELMONT AND KEDZIE
BLUE ISLAND AND DAMEN
BROADWAY/SHERIDAN AND DEVON
CALIFORNIA AND DEVON
CALIFORNIA AND DIVERSEY
CALIFORNIA AND PETERSON
CANAL AND ROOSEVELT
CENTRAL AND ADDISON
CENTRAL AND BELMONT
CENTRAL AND CHICAGO
CENTRAL AND DIVERSEY
CENTRAL AND FULLERTON
CENTRAL AND IRVING PARK
CENTRAL AND LAKE
CENTRAL AND MILWAUKEE
CERMAK AND PULASKI

In [106]:
big_df.intersection.isna().sum()

1797

In [107]:
big_df.intersection.unique()
big_df.dropna(subset=['intersection'], inplace=True)


In [108]:
big_df.head()

Unnamed: 0,year,month,day,intersection
365,2016,1,1,115TH AND HALSTED
366,2016,1,2,115TH AND HALSTED
367,2016,1,3,115TH AND HALSTED
368,2016,1,4,115TH AND HALSTED
369,2016,1,5,115TH AND HALSTED


In [109]:
pd.options.display.max_rows = 500


In [110]:
print(365.25 * 6 * 153)
make_table(big_df, 'all_hours', c, conn)

335299.5
[('cam_locations',), ('cam_startend',), ('daily_violations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('signal_crashes',), ('int_startend',), ('intersection_chars',), ('all_hours',)]


## FUTURE WORK? traffic_count TABLE from traffic_count
Congestion did not work in the model.  It is by region, and the regions added very little.
We have data from a traffic study that gives average volume of traffic by street segment.
We will try to match up the segment(s) to the cameras.  This might be tricky.

Problem: I have start/end dates for all of my cams.
I do not have start/end dates by intersection, and now all of my data is sorted by intersection.

In [112]:
cam_startend
int_cams.head()

Unnamed: 0,intersection,latitude,longitude,cam1,cam2,cam3,region_id
0,111TH AND HALSTED,41.692362,-87.642423,2422,2424,,26
1,115TH AND HALSTED,41.685089,-87.642094,2552,2553,,26
2,119TH AND HALSTED,41.677774,-87.64193,2402,2404,,26
3,31ST ST AND MARTIN LUTHER KING DRIVE,41.838441,-87.617338,2121,2123,,16
4,35TH AND WESTERN,41.830281,-87.684775,2091,2092,,15


In [113]:
def find_int(cam_id, int_cams):
    '''returns the intersection associated with the red light camera'''
    my_int = int_cams[(int_cams['cam1']==cam_id) |
                 (int_cams['cam2']==cam_id) |
                 (int_cams['cam3']==cam_id)
                    ]['intersection'].max()
    return my_int

int_startend = cam_startend.copy()
int_startend.head()
int_startend['intersection'] = int_startend['camera_id'].apply(lambda x: find_int(x, int_cams))

In [114]:
int_startend = int_startend.groupby('intersection').agg({'start':'min', 'end':'max'}).reset_index()

In [115]:
int_startend.head()

Unnamed: 0,intersection,start,end
0,111TH AND HALSTED,2016-01-02,2021-01-19
1,115TH AND HALSTED,2016-01-02,2017-10-26
2,119TH AND HALSTED,2016-01-02,2021-01-19
3,31ST ST AND MARTIN LUTHER KING DRIVE,2016-01-02,2021-01-19
4,35TH AND WESTERN,2016-01-02,2021-01-19


In [116]:
make_table(int_startend, 'int_startend', c, conn)

[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('signal_crashes',), ('all_hours',), ('int_startend',)]


# Add speed into my int_char
Use the average (or mode speed) from crashes at intersection to add to my intersection charateristics.

In [78]:
speed_df = pd.read_sql_query("SELECT * FROM signal_crashes", conn)
int_char_df = pd.read_sql_query("SELECT * FROM intersection_chars", conn)


In [79]:
speed_df.posted_speed_limit = speed_df.posted_speed_limit.astype(int)

In [80]:
speed_df = speed_df.groupby('intersection').agg({'posted_speed_limit':max}).reset_index()  # mode is from scipy

In [81]:
speed_df.posted_speed_limit.unique()
speed_df.head()

Unnamed: 0,intersection,posted_speed_limit
0,111TH AND HALSTED,35
1,115TH AND HALSTED,35
2,119TH AND HALSTED,35
3,31ST AND CALIFORNIA,35
4,31ST ST AND MARTIN LUTHER KING DRIVE,35


In [85]:

# df is defined globally
def speed_lookup(intersection):
    # needed to put a try in there because one intersection had no crashes over time period
    try:
        speed = speed_df[speed_df['intersection']==intersection]['posted_speed_limit'].values[0]
    except:
        speed=30
    return speed
    
int_char_df['speed'] = int_char_df['intersection'].apply(speed_lookup)

In [86]:
int_char_df.head()

Unnamed: 0,protected_turn,total_lanes,medians,exit,split,way,underpass,no_left,angled,triangle,one_way,turn_lanes,lat,long,rlc,intersection,daily_traffic,speed
0,2,6,2,0,0,4,0,0,1,0,0,2,41.692362,-87.642423,1,111TH AND HALSTED,43100,35
1,4,6,2,0,0,4,0,0,0,0,0,4,41.685089,-87.642094,1,115TH AND HALSTED,42500,35
2,4,6,2,0,0,4,0,0,0,0,0,4,41.677774,-87.64193,1,119TH AND HALSTED,41800,35
3,2,6,0,0,0,4,0,0,0,0,0,4,41.837424,-87.695022,1,31ST AND CALIFORNIA,41100,35
4,2,10,2,0,1,4,0,2,0,0,0,0,41.838441,-87.617338,1,31ST ST AND MARTIN LUTHER KING DRIVE,36500,35


In [92]:
make_table(int_char_df, 'intersection_chars', c, conn)

[('cam_locations',), ('cam_startend',), ('daily_violations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('signal_crashes',), ('all_hours',), ('int_startend',), ('intersection_chars',)]
