# Building a SQL database
This notebook builds the necessary db files for the project using SQLite3.

Most data is taken from Chicago Data Portal https://data.cityofchicago.org/ using Socrata library.
The API endpoints for the data are:
- Red Light Violations: https://data.cityofchicago.org/resource/spqx-js37.json
- Congestion by Region 2018-Present: https://data.cityofchicago.org/resource/kf7e-cur8.json
- Congestion by Region 2013-2018: https://data.cityofchicago.org/resource/emtn-qqdi.json
- Traffic Crashes: https://data.cityofchicago.org/resource/85ca-t3if.json

Weather data is taken from https://openweathermap.org/weather-data and is saved as csv in data folder

Tables to build:
- daily_violations (one entry for each camera each day with total violations)
- intersection_locations (one entry for each intersection with lat/long)
- intersection_cams (one entry for each intersection with camera_ids)
- signal_crashes (one entry for each intersection crash with many columns)
- cam_locations (one entry for each cam, with lat/long)
- cam_startend (one entry for each cam with start end dates for min/max dates active)
- hourly_congestion (one entry per hour with bus speed averages for each region)
- hourly_weather (one entry per hour with many weather cols)
- region_data (one entry per region with locations and descriptions to place intersections)


## Connecting and Setup

### Required Imports

In [20]:
import pandas as pd
from sodapy import Socrata
#import matplotlib.pyplot as plt
from datetime import datetime
from modules.myfuncs import *
import warnings
import numpy as np
from geopy.geocoders import Nominatim
from scipy.stats import mode, percentileofscore

warnings.filterwarnings('ignore')

### Create/connect to db and build the TABLEs
We create the connection and cursor objects we will use to communicate with our SQLite database.

In [21]:
# Create a db file or open connection
conn = create_connection('database/rlc.db')  # function I created in myfuncs file
c = conn.cursor()
#conn.close()

sqlite3 version: 2.6.0
connected to database/rlc.db


### Set up the Socrata client
The Chicago Data Portal, which contains most of my data used in this project, uses the Socrata software which can be accessed through Python's sodapy library.
Here we create a client, which we will use to query the data at the portal.

The data API is at [data.cityofchicago.org](data.cityofchicago.org)
The individual db endpoints are found by browsing the site.

In [3]:
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:

url = "data.cityofchicago.org"
client = Socrata(url, None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofchicago.org,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")



# Build the TABLEs

For every TABLE
- Use a Socrata client query to get all relevant data
- Preprocess data as needed
- Create Table

Our data
- rlc_cam is up to 1M redlight cams from 2015 to 2020
- crash_data is up to 1M crashes from 2015 to 2020
- traffic_data is up to 10M from 2015 to 2020
- wx_data is a csv file from [openweathermap.org](openweathermap.org)

Weather data is taken from csv in data folder

## 1) Build intersection_chars TABLE from int_df

This data is created by me.  It is a dictionary contained in this repository under the file 'int_chars.py'.

To create this file, I went through all 180+ intersections with red light cameras.  I cross referenced it with a map at https://data.cityofchicago.org/Transportation/Average-Daily-Traffic-Counts-Map/pf56-35rv and google maps to compile the following data.

- roads (list): of road segments as identified in average-daily-traffic-counts db in link above.  Used to determine volume of traffic.
- protected_turn (int): How any of the left turns are protected (left turn arrow).
- total_lanes (int): Count of total lanes.  If a road has one lane for all directions N/E/S/W bound traffic, that counts as 4.  Rangees from 3 to 14 lanes.
- medians (int): Count of physical median barriers that extend up to intersection.
- exit (int): 0 if no exit.  1 if exit on/off ramp within 100m of center of intersection.  Traffic flow is affected by proximity to exit.
- split (int): 1 if it is a divided boulevard (common in Chicago) where divided lanes are split by traffic signals in median. Look at examples on google map.
- way (int): directions of traffic flow. A 4 way intersection might be NESW.
- underpass (int): number of ways that have an underpass extending up to the intersection.  These are notoriously bad intersections in Chicago.
- no_left (int): number of no left turn signs.  Usually with smaller streets onto larger roads or high volumne intersections.
- angled (int): 1 if angle between two 2way roads is greater than 30 degrees (used 1/2/sqrt(3) rule to measure.
- triangle (int): 1 if three 2way roads meet intersect or form a triangle where all 3 roads <50m
- one_way (int): number of 1 way directions.
- turn_lanes (int): how many directions have physical and identified turn lanes for left hand turns.
- lat (float): latitude of center of inersection.  
- long (float): longitude 
- rlc (int): 1 for red light camera is present
- intersection (str): name of intersection as defined in signal_crashes table in db
- daily_traffic (int): volume of daily traffic through intersection.  Sum of incoming roads from roads list.

### Preprocess int_df
Import the dictionary and convert to DataFrame which will be written as Table in my db

In [4]:
from modules.int_chars import *
import pandas as pd

int_chars.keys()
int_df = pd.DataFrame.from_dict(int_chars, orient='index')
int_df['intersection'] = int_chars.keys()
int_df.isna().sum()

roads             0
protected_turn    0
total_lanes       0
medians           0
exit              0
split             0
way               0
underpass         0
no_left           0
angled            0
triangle          0
one_way           0
turn_lanes        0
lat               0
long              0
rlc               0
intersection      0
dtype: int64

For now, we will only use intersections with rlc of 1.  I may later add intersections without rlc to identify crash characterisics.

In [5]:
int_df = int_df[int_df['rlc']==1]  # I entertained adding additional non-camera intersections
int_df.columns

Index(['roads', 'protected_turn', 'total_lanes', 'medians', 'exit', 'split',
       'way', 'underpass', 'no_left', 'angled', 'triangle', 'one_way',
       'turn_lanes', 'lat', 'long', 'rlc', 'intersection'],
      dtype='object')

They were read in as non_null objects.  Would like to cast them before creating a table

In [6]:
cols_toint = ['protected_turn', 'total_lanes', 'medians', 'exit', 'split',
       'way', 'underpass', 'no_left', 'angled', 'triangle', 'one_way',
       'turn_lanes', 'rlc']
cols_tofloat = ['lat', 'long',]

int_df[cols_toint] = int_df[cols_toint].astype(int)
int_df[cols_tofloat] = int_df[cols_tofloat].astype(float)
int_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 182 entries, 111TH AND HALSTED to WESTERN AND TOUHY
Data columns (total 17 columns):
roads             182 non-null object
protected_turn    182 non-null int64
total_lanes       182 non-null int64
medians           182 non-null int64
exit              182 non-null int64
split             182 non-null int64
way               182 non-null int64
underpass         182 non-null int64
no_left           182 non-null int64
angled            182 non-null int64
triangle          182 non-null int64
one_way           182 non-null int64
turn_lanes        182 non-null int64
lat               182 non-null float64
long              182 non-null float64
rlc               182 non-null int64
intersection      182 non-null object
dtypes: float64(2), int64(13), object(2)
memory usage: 25.6+ KB


In [7]:
int_df.head()  # verify my table data before commit

Unnamed: 0,roads,protected_turn,total_lanes,medians,exit,split,way,underpass,no_left,angled,triangle,one_way,turn_lanes,lat,long,rlc,intersection
111TH AND HALSTED,"[28 West, 11600 South]",2,6,2,0,0,4,0,0,1,0,0,2,41.692362,-87.642423,1,111TH AND HALSTED
115TH AND HALSTED,"[714 West, 11600 South]",4,6,2,0,0,4,0,0,0,0,0,4,41.685089,-87.642094,1,115TH AND HALSTED
119TH AND HALSTED,"[446 West, 11600 South]",4,6,2,0,0,4,0,0,0,0,0,4,41.677774,-87.64193,1,119TH AND HALSTED
31ST AND CALIFORNIA,"[2825 West, 3026 South]",2,6,0,0,0,4,0,0,0,0,0,4,41.837424,-87.695022,1,31ST AND CALIFORNIA
31ST ST AND MARTIN LUTHER KING DRIVE,"[440 East, 3030 South]",2,10,2,0,1,4,0,2,0,0,0,0,41.838441,-87.617338,1,31ST ST AND MARTIN LUTHER KING DRIVE


### Add daily traffic volume
Now bring in my count information using my roads list associated with each intersection
This data is also from the data portal.  A survey was done in 2013 for two weeks.  Traffic flow was recoreding at many points around the city.  We make the assumption that the traffic patterns at that time are at least somewhat consistent with the more current patterns and can be used as an estimation of how busy the intersections is on a typical day.

In [8]:
daily_traffic = client.get("pfsx-4n4m", 
                     limit=2000,
                    )

daily_traffic = pd.DataFrame.from_records(daily_traffic) # Convert to pandas DataFrame

In [9]:
daily_traffic.info()
cols_tokeep = ['traffic_volume_count_location_address', 'total_passing_vehicle_volume',]
daily_traffic = daily_traffic[cols_tokeep]

daily_traffic.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1279 entries, 0 to 1278
Data columns (total 15 columns):
id                                             1279 non-null object
traffic_volume_count_location_address          1279 non-null object
street                                         1279 non-null object
date_of_count                                  1279 non-null object
total_passing_vehicle_volume                   1279 non-null object
vehicle_volume_by_each_direction_of_traffic    1279 non-null object
latitude                                       1279 non-null object
longitude                                      1279 non-null object
location                                       1279 non-null object
:@computed_region_rpca_8um6                    1266 non-null object
:@computed_region_vrxf_vc4k                    1266 non-null object
:@computed_region_6mkv_f3dw                    1279 non-null object
:@computed_region_bdys_3d7i                    1265 non-null object
:@compute

Unnamed: 0,traffic_volume_count_location_address,total_passing_vehicle_volume
0,5838 West,7100
1,320 East,8600
2,1730 East,53500
3,125 East,700
4,2924 East,4200


In [10]:
daily_traffic.total_passing_vehicle_volume = daily_traffic.total_passing_vehicle_volume.astype(int)

Combine my characteristics with my daily_traffic by looking up traffic volume from daily_traffic df.

In [11]:
def look_up_roads(road_list):
    '''
    Look up function to get the values and return the total
            Parameters:
                roads (list): road segment list for intersection
            Returns:
                total (int): combined traffic volume of every road in roads list.
    '''
    total = 0  
    for road in road_list:
        count = daily_traffic[daily_traffic['traffic_volume_count_location_address']==road]['total_passing_vehicle_volume'].values[0]
        total += count
    return total

int_df['daily_traffic'] = int_df['roads'].apply(look_up_roads)
int_df.drop(columns=['roads'], inplace=True)

In [12]:
int_df.head()

Unnamed: 0,protected_turn,total_lanes,medians,exit,split,way,underpass,no_left,angled,triangle,one_way,turn_lanes,lat,long,rlc,intersection,daily_traffic
111TH AND HALSTED,2,6,2,0,0,4,0,0,1,0,0,2,41.692362,-87.642423,1,111TH AND HALSTED,43100
115TH AND HALSTED,4,6,2,0,0,4,0,0,0,0,0,4,41.685089,-87.642094,1,115TH AND HALSTED,42500
119TH AND HALSTED,4,6,2,0,0,4,0,0,0,0,0,4,41.677774,-87.64193,1,119TH AND HALSTED,41800
31ST AND CALIFORNIA,2,6,0,0,0,4,0,0,0,0,0,4,41.837424,-87.695022,1,31ST AND CALIFORNIA,41100
31ST ST AND MARTIN LUTHER KING DRIVE,2,10,2,0,1,4,0,2,0,0,0,0,41.838441,-87.617338,1,31ST ST AND MARTIN LUTHER KING DRIVE,36500


### Create intersection_chars TABLE

In [13]:
make_table(int_df, 'intersection_chars', c, conn)  # function from import

[('daily_violations',), ('cam_locations',), ('cam_startend',), ('intersection_chars',)]


## 2) Build daily_violations TABLE  from rlc_df

This table will hold the red light camera violations data.  It will include daily violations of every red light camera.
This is a large dataset.  (320 cameras with daily violations over the past 5 years)

### Query red light violations
Use the API endpoint to get data with Socrata query.

In [14]:
# Red light violations
# Takes several minutes to run and holds about 500mb in memory to build

# First 1000000 results, returned as JSON from API / converted to Python list of dictionaries by sodapy
rlc_df = client.get("spqx-js37", #speed cams are at 'hhkd-xvj4' if you want to investigate?
                     #where='violation_date > 01-01-2020',
                     where='violation_date > \'2017-01-01T00:00:00.000\'',
                     limit=10000000,
                    )

rlc_df = pd.DataFrame.from_records(rlc_df) # Convert to pandas DataFrame

### Preprocess Red Light Camera Data

Data Columns of interest (from API docs):

INTERSECTION -
Intersection of the location of the red light enforcement camera(s). There may be more than one camera at each intersection. Plain Text

CAMERA ID -
A unique ID for each physical camera at an intersection, which may contain more than one camera. Plain Text

ADDRESS	-
The address of the physical camera (CAMERA ID). The address may be the same for all cameras or different, based on the physical installation of each camera. Plain Text

VIOLATION DATE -
The date of when the violations occurred. NOTE: The citation may be issued on a different date. Date & Time

VIOLATIONS - 
Number of violations for each camera on a particular day. Number

LATITUDE -
The latitude of the physical location of the camera(s) based on the ADDRESS column. Geocoded using the WGS84. Number

LONGITUDE -
The longitude of the physical location of the camera(s) based on the ADDRESS column. Geocoded using the WGS84.
Number

#### Investigate violation data

In [15]:
rlc_df.info()
rlc_df.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388763 entries, 0 to 388762
Data columns (total 10 columns):
intersection      388763 non-null object
camera_id         388762 non-null object
address           388763 non-null object
violation_date    388763 non-null object
violations        388763 non-null object
x_coordinate      368428 non-null object
y_coordinate      368428 non-null object
latitude          368428 non-null object
longitude         368428 non-null object
location          368428 non-null object
dtypes: object(10)
memory usage: 29.7+ MB


intersection          0
camera_id             1
address               0
violation_date        0
violations            0
x_coordinate      20335
y_coordinate      20335
latitude          20335
longitude         20335
location          20335
dtype: int64

#### Drop nan values and unnecessary columns
We see that we have all text/non-null objects.  Need to convert first before manipulating for preprocess.

There are a fair number of missing locations/lat/long.  Hope to be able to replace those missing values.
This represents a large enough portion of dataset that we should look them up.

The na values for camera_id will have to be dropped, since we don't know what they are.

We will not be using x andy y_coordinate, so we drop those.  We will also drop location.  We already have lat long in other columns.

In [16]:
#client_df.dropna(subset=['camera_id']).isna().sum()
try:
    # put this is a try in case we run it twice, it will skip it.
    rlc_df.dropna(subset=['camera_id'], inplace=True)
    
    # drop xy coord and location columns
    rlc_df = rlc_df.drop(columns=['x_coordinate', 'y_coordinate', 'location'], index=1)
except:
    pass



rlc_df.isna().sum()

intersection          0
camera_id             0
address               0
violation_date        0
violations            0
latitude          20335
longitude         20335
dtype: int64

In [17]:
rlc_df.intersection.sort_values().unique()

array(['111TH AND HALSTED', '115TH AND HALSTED', '119TH AND HALSTED',
       '31ST ST AND MARTIN LUTHER KING DRIVE', '35TH AND WESTERN',
       '4700 WESTERN', '55TH AND KEDZIE', '55TH AND WESTERN',
       '55TH and PULASKI', '63RD AND STATE', '71ST AND ASHLAND',
       '75TH AND STATE', '79TH AND HALSTED', '79TH AND KEDZIE',
       '87TH AND VINCENNES', '95TH AND STONEY ISLAND', '99TH AND HALSTED',
       'ADDISON AND HARLEM', 'ARCHER AND CICERO', 'ASHLAND AND 87TH',
       'ASHLAND AND 95TH', 'ASHLAND AND DIVISION',
       'ASHLAND AND FULLERTON', 'ASHLAND AND IRVING PARK',
       'ASHLAND AND LAWRENCE', 'ASHLAND AND MADISON',
       'AUSTIN AND ADDISON', 'AUSTIN AND IRVING PARK',
       'BELMONT AND KEDZIE', 'BROADWAY/SHERIDAN AND DEVON',
       'CALIFORNIA AND DEVON', 'CALIFORNIA AND DIVERSEY',
       'CALIFORNIA AND PETERSON', 'CANAL AND ROOSEVELT',
       'CENTRAL AND ADDISON', 'CENTRAL AND BELMONT',
       'CENTRAL AND CHICAGO', 'CENTRAL AND DIVERSEY',
       'CENTRAL AND FULLER

#### Fix intersection errors identified in EDA

In [18]:
# fix a specific intersection naming problem
# both 'NORTHWEST HIGHWAY AND FOSTER' and 'FOSTER AND NORTHWEST HIGHWAY' exist.  Only one should be used.  They are same.
# Also happens with "MILWAUKEE AND CENTRAL" / "CENTRAL AND MILWAUKEE"
rlc_df['intersection'] = rlc_df['intersection'].apply(lambda x: 'MILWAUKEE AND CENTRAL' if x=='CENTRAL AND MILWAUKEE' else x)
rlc_df['intersection'] = rlc_df['intersection'].apply(lambda x: 'FOSTER AND NORTHWEST HIGHWAY' if x=='NORTHWEST HIGHWAY AND FOSTER' else x)

#### Manipulate datatypes for preprocessing. 

In [19]:
rlc_df['violations'] = rlc_df['violations'].astype(int)
rlc_df['latitude'] = rlc_df['latitude'].astype(float)
rlc_df['longitude'] = rlc_df['longitude'].astype(float)
rlc_df['violation_date'] = pd.to_datetime(rlc_df['violation_date'])
rlc_df['month'] = rlc_df['violation_date'].apply(lambda x: int(x.month))
rlc_df['day'] = rlc_df['violation_date'].apply(lambda x: int(x.day))  # fixed from dat to day!

rlc_df['weekday'] = rlc_df['violation_date'].apply(lambda x: int(datetime.weekday(x)))
rlc_df['year'] = rlc_df['violation_date'].apply(lambda x: int(x.year))

### Create daily_violations TABLE 

In [20]:
make_table(rlc_df, 'daily_violations', c, conn)

[('cam_locations',), ('cam_startend',), ('intersection_chars',), ('daily_violations',)]


## 3) Build cam_locations and cam_startend TABLEs
Build tables from cam_locs AND TABLE from cam_startend.
We wll bring in camera locations from data portal and identify the start and end date for each camera.  We will use the start end dates to develop a natural experiment where we identify which cameras were turned on or off during our study timeframe.

### Camera location preprocessing
Make a df with info for each camera
Will contain the following:
- camera_id
- location
- start date (when was the camera turned on)
- end date (when was the camera turned off)

In [21]:
cam_df = rlc_df.copy()
cam_df['start'] = cam_df['camera_id'].apply(lambda x: None)
cam_df['end'] = cam_df['camera_id'].apply(lambda x: None)

In [22]:
cam_start = cam_df.groupby(['camera_id'])['violation_date'].min().reset_index()
cam_end = cam_df.groupby(['camera_id'])['violation_date'].max().reset_index()

cam_startend = cam_start.copy()

#print(cam_end[cam_end['camera_id']=='1503'].values[0][1])  # for testing output
cam_startend['end'] = cam_start['camera_id'].apply(lambda x: cam_end[cam_end['camera_id']==x].values[0][1])

cam_startend.rename(columns={"violation_date": "start"}, inplace=True)
                                                   
print('NA values in cam_startend:', cam_startend.isna().sum(), end='\n\n', sep='\n')

print('Describe cam_startend:', cam_startend.describe(), end='\n\n', sep='\n')



NA values in cam_startend:
camera_id    0
start        0
end          0
dtype: int64

Describe cam_startend:
       camera_id                start                  end
count        316                  316                  316
unique       316                   16                   17
top         1994  2017-01-02 00:00:00  2021-02-15 00:00:00
freq           1                  253                  196
first        NaN  2017-01-02 00:00:00  2017-05-29 00:00:00
last         NaN  2018-03-05 00:00:00  2021-02-15 00:00:00



#### Make a df that has camera locations and intersections
Intersections are present (and addresses), but we do not have lat/long info for all cams

In [23]:
# we had some incorrect data in the code below, but have a creative fix.
cam_locs = rlc_df.groupby(['camera_id', 'intersection']).max().reset_index()
cam_locs.head()

# we find there is a mismatch between lens, one of them is duplicated
len(cam_locs)  # 364 total
len(cam_locs['camera_id'].unique()) # 363

cam_locs[cam_locs['camera_id'].duplicated()]  # 1421 is dupe
print('Two of them\n', cam_locs[cam_locs['camera_id'] == '1421'])  # we see two of them
print()

# Which one is it?
print('Damen/Diversey', rlc_df[(rlc_df['camera_id']=='1421') & (rlc_df['intersection']=='DAMEN AND DIVERSEY')]['camera_id'].count())
print('Laramie/Fullerton:', rlc_df[(rlc_df['camera_id']=='1421') & (rlc_df['intersection']=='LARAMIE AND FULLERTON')]['camera_id'].count())

# Turns out that a camera has two locations. One was only used one time.  We drop it.
cam_locs = cam_locs[(cam_locs['camera_id']!='1421') | (cam_locs['intersection']!='DAMEN AND DIVERSEY')]
print("Total cams", len(cam_locs))  # 363 total (got rid of the bad one)

Two of them
    camera_id           intersection                  address violation_date  \
75      1421     DAMEN AND DIVERSEY  2000 W DIVERSEY PARKWAY     2017-11-30   
76      1421  LARAMIE AND FULLERTON    2400 N LARAMIE AVENUE     2021-02-13   

    violations   latitude  longitude  month  day  weekday  year  
75           1  41.932394 -87.678173     11   30        3  2017  
76           6  41.924152 -87.756295     12   31        6  2021  

Damen/Diversey 1
Laramie/Fullerton: 853
Total cams 316


In [24]:
cam_locs.isna().sum()  # missing location for 19 cameras.  Let's fix it

camera_id          0
intersection       0
address            0
violation_date     0
violations         0
latitude          17
longitude         17
month              0
day                0
weekday            0
year               0
dtype: int64

Looks like we also are missing 19 of the 363 cam locations.  Let's look it up!

We actually changed this code.  We have gone to the maps to get intersection locations, but will leave old code in case we ever need exact location of cameras.

In [25]:
cam_locs.info()
cam_locs.isna().sum() # No longer missing location for 19 cameras.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 316 entries, 0 to 316
Data columns (total 11 columns):
camera_id         316 non-null object
intersection      316 non-null object
address           316 non-null object
violation_date    316 non-null datetime64[ns]
violations        316 non-null int64
latitude          299 non-null float64
longitude         299 non-null float64
month             316 non-null int64
day               316 non-null int64
weekday           316 non-null int64
year              316 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(5), object(3)
memory usage: 29.6+ KB


camera_id          0
intersection       0
address            0
violation_date     0
violations         0
latitude          17
longitude         17
month              0
day                0
weekday            0
year               0
dtype: int64

### Create cam_locations TABLE
from cam_locs AND cam_startend from cam_startend

In [26]:
make_table(cam_locs, 'cam_locations', c, conn)
make_table(cam_startend, 'cam_startend', c, conn)

[('cam_startend',), ('intersection_chars',), ('daily_violations',), ('cam_locations',)]
[('intersection_chars',), ('daily_violations',), ('cam_locations',), ('cam_startend',)]


###  Fix lat/long data
We still have missing lat/long info for our rlc_df.  Let's fix it
Before moving on.  Now that we have cam_locs, we can fix our rlc_df

In [27]:
rlc_df.isna().sum()  


intersection          0
camera_id             0
address               0
violation_date        0
violations            0
latitude          20335
longitude         20335
month                 0
day                   0
weekday               0
year                  0
dtype: int64

### Change cam position to intersection lat/lon
During EDA, we found out that five cameras were in completely wrong lat/long location.  
Several others were located a little too far from the intersection to work properly.  When we rebuild the db, we will use bigger number than 30 m.

I intend to use intersection locations in lieu of camera locations.  I hope this makes all of my position data consistent for gathering crash info, and eliminates chicago data portal errors.

When using cam location, it is sometimes up to 35 m up road where cam position is.  This would cause us to misidentify crashes from other intersections or miss some in the intersection of interest. 

Remedy: Use center point of intersection for all cams.  I have done this in the intersection_chars (looked it up on google maps for all 180+ intersections)

In [28]:
int_df.columns


Index(['protected_turn', 'total_lanes', 'medians', 'exit', 'split', 'way',
       'underpass', 'no_left', 'angled', 'triangle', 'one_way', 'turn_lanes',
       'lat', 'long', 'rlc', 'intersection', 'daily_traffic'],
      dtype='object')

In [29]:
def location_correction(int_df, intersect, latlong):
    # lookup function from intersection df to get the lat long
    # int_df is the intersection characteristic frame from 1) above
    # intersect is the intersection name used to link tables/df
    # latlong is either 'lat' or 'long'
    if latlong == 'lat':
        lat = int_df[int_df['intersection']==intersect]['lat'].values[0]
        if lat==None: print(lat, intersect)
        return lat
    else:
        long = int_df[int_df['intersection']==intersect]['long'].values[0]
        return long

cam_locs['latitude'] = cam_locs['intersection'].apply(lambda x: location_correction(int_df, x, 'lat'))
cam_locs['longitude'] = cam_locs['intersection'].apply(lambda x: location_correction(int_df, x, 'long'))

In [30]:
rlc_df.intersection.head()

0                 4700 WESTERN
2           CICERO AND ADDISON
3    LAKE SHORE DR AND BELMONT
4       SACRAMENTO AND CHICAGO
5         PETERSON AND WESTERN
Name: intersection, dtype: object

In [31]:
int_df[int_df['intersection']=='IRVING PARK AND KILPATRICK']

Unnamed: 0,protected_turn,total_lanes,medians,exit,split,way,underpass,no_left,angled,triangle,one_way,turn_lanes,lat,long,rlc,intersection,daily_traffic
IRVING PARK AND KILPATRICK,1,6,0,0,0,4,0,0,0,0,1,3,41.953395,-87.744635,1,IRVING PARK AND KILPATRICK,37100


Make my cameras have a location that is center of intersection instead of exact cam location.

In [32]:
#⏳⏳⏳⏳⏳  (8min on my macbook pro)
def read_loc(int_df, intersection):
    # This function looks up the new camera/intersction location using the intersection name as key
    cam = int_df[int_df['intersection']==intersection]
    #print(cam)
    return (float(cam['lat']), float(cam['long']))
        


# create a location column so we only have to do it once
rlc_df['location'] = rlc_df['intersection'].apply(lambda x: read_loc(int_df, x))
rlc_df.head()

Unnamed: 0,intersection,camera_id,address,violation_date,violations,latitude,longitude,month,day,weekday,year,location
0,4700 WESTERN,2141,4700 S WESTERN AVENUE,2019-06-05,3,41.808378,-87.684571,6,5,2,2019,"(41.808442084381, -87.68418270817706)"
2,CICERO AND ADDISON,1612,3600 N CICERO AVENUE,2019-06-05,7,41.946164,-87.747215,6,5,2,2019,"(41.946123417859745, -87.74705265633155)"
3,LAKE SHORE DR AND BELMONT,1413,400 W BELMONT AVE,2019-06-05,75,41.940241,-87.639639,6,5,2,2019,"(41.94046398185605, -87.6383448872575)"
4,SACRAMENTO AND CHICAGO,1814,3000 W CHICAGO AVENUE,2019-06-05,8,41.895705,-87.702219,6,5,2,2019,"(41.89559271274954, -87.70223070169483)"
5,PETERSON AND WESTERN,1014,2400 W PETERSON,2019-06-05,6,41.990609,-87.689735,6,5,2,2019,"(41.99053050329496, -87.68961714584131)"


In [33]:
# then add in the new lat longs to the df
rlc_df['latitude'] = rlc_df['location'].apply(lambda x: x[0])
rlc_df['longitude'] = rlc_df['location'].apply(lambda x: x[1])

In [34]:
rlc_df[rlc_df.latitude.isna()]['intersection'].unique()  # which intersections am I still missing

array([], dtype=object)

In [35]:
rlc_df.head()

Unnamed: 0,intersection,camera_id,address,violation_date,violations,latitude,longitude,month,day,weekday,year,location
0,4700 WESTERN,2141,4700 S WESTERN AVENUE,2019-06-05,3,41.808442,-87.684183,6,5,2,2019,"(41.808442084381, -87.68418270817706)"
2,CICERO AND ADDISON,1612,3600 N CICERO AVENUE,2019-06-05,7,41.946123,-87.747053,6,5,2,2019,"(41.946123417859745, -87.74705265633155)"
3,LAKE SHORE DR AND BELMONT,1413,400 W BELMONT AVE,2019-06-05,75,41.940464,-87.638345,6,5,2,2019,"(41.94046398185605, -87.6383448872575)"
4,SACRAMENTO AND CHICAGO,1814,3000 W CHICAGO AVENUE,2019-06-05,8,41.895593,-87.702231,6,5,2,2019,"(41.89559271274954, -87.70223070169483)"
5,PETERSON AND WESTERN,1014,2400 W PETERSON,2019-06-05,6,41.990531,-87.689617,6,5,2,2019,"(41.99053050329496, -87.68961714584131)"


In [36]:
# get rid of location column.  We have latitude and longitude in separate columns and don't want BLOBs for now
if 'location' in rlc_df.columns:
    rlc_df.drop(columns=['location'], inplace=True)

### Create daily_violations TABLE
Now we have gone back to rlc_df and fixd the location data to line up with intersections.
We commit it to our database.

In [37]:
make_table(rlc_df, 'daily_violations', c, conn)

[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',)]


## 4) Build intersection_cams TABLE
Create a table to store intersection and camera data using int_cams df.

### Preprocess intersection_cams
We will now focus on trying to bring rlc intersections to our crashes
We find that we have 363 cameras at 183 intersections.

Group my data by intersection and pull out the individual camera_id.  We place them into separate columns.  This allows us to store the individual camera ids associated with each intersection should we need them for a query.

In [38]:
int_cams = cam_locs.groupby(['intersection']) \
                    .agg({'latitude':pd.Series.max, 'longitude':pd.Series.max,}) \
                    .reset_index()

int_cams['cam1'] = int_cams['intersection'] \
                            .apply(lambda x: cam_locs[cam_locs['intersection']==x]['camera_id'].iloc[0])

int_cams['cam2'] = int_cams['intersection'].apply( \
                            lambda x: None if len(cam_locs[cam_locs['intersection']==x])==1 \
                            else cam_locs[cam_locs['intersection']==x]['camera_id'].iloc[1])

int_cams['cam3'] = int_cams['intersection'].apply( \
                            lambda x: None if len(cam_locs[cam_locs['intersection']==x])<3 \
                            else cam_locs[cam_locs['intersection']==x]['camera_id'].iloc[2])                             

int_cams.head()

Unnamed: 0,intersection,latitude,longitude,cam1,cam2,cam3
0,111TH AND HALSTED,41.692362,-87.642423,2422,2424,
1,115TH AND HALSTED,41.685089,-87.642094,2552,2553,
2,119TH AND HALSTED,41.677774,-87.64193,2402,2404,
3,31ST ST AND MARTIN LUTHER KING DRIVE,41.838441,-87.617338,2121,2123,
4,35TH AND WESTERN,41.830281,-87.684775,2091,2092,


In [39]:
print('Total Cameras', len(cam_locs))
print('Total Intersections', len(int_cams))

Total Cameras 316
Total Intersections 157


### Create intersection_cams TABLE 
Create it from int_cams

In [40]:
make_table(int_cams, 'intersection_cams', c, conn)

[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',)]


## 5) Create signal_crashes and all_crashes TABLE
Create all_crashes from crash_df.  All crashes will include all signal crashes in the city.

Create signal_crashes from crash_df. This will only be the signal crashes at red light cam intersections.

The crash data has a column for 'intersection related crashes', and another for 'traffic signal'.  We use this to filter our data.


### Crash data preprocessing


In [41]:
# Crash Data
crash_data = client.get("85ca-t3if", 
                     where="crash_date > \'2016-01-01T00:00:00.000\'",
                     limit=1000000,
                    )

crash_df = pd.DataFrame.from_records(crash_data) # Convert to pandas DataFrame

In [42]:
# drop a few columns we don't need, including location (we have lat/long)
dropme = ['statements_taken_i', 'private_property_i', 'photos_taken_i', 'dooring_i', 'date_police_notified','location']

crash_df.drop(columns=dropme, inplace=True)

In [43]:
crash_df.isna().sum()

crash_record_id                       0
rd_no                              4100
crash_date                            0
posted_speed_limit                    0
traffic_control_device                0
device_condition                      0
weather_condition                     0
lighting_condition                    0
first_crash_type                      0
trafficway_type                       0
alignment                             0
roadway_surface_cond                  0
road_defect                           0
report_type                       11699
crash_type                            0
damage                                0
prim_contributory_cause               0
sec_contributory_cause                0
street_no                             0
street_direction                      3
street_name                           1
beat_of_occurrence                    5
num_units                             0
most_severe_injury                  961
injuries_total                      950


We have 2.5k entries that have no location.  Let's drop them

In [44]:
crash_df.dropna(subset=['latitude',], inplace=True)  # get rid of na locations

### Contents of crash data

Let's look at what is in the data.  We have over 30 columns.  Much of it is categorical.  We want to look and see what is in here before we continue.  

We start with over half a million crashes.

In [45]:
# What's in this data?
col_interest = ['traffic_control_device', 'device_condition', 'weather_condition',
       'lighting_condition', 'first_crash_type', 'trafficway_type',
       'alignment', 'roadway_surface_cond', 'road_defect', 'report_type',
       'crash_type', 'hit_and_run_i', 'damage', 'prim_contributory_cause',
       'sec_contributory_cause', 'street_no', 'street_direction',
       'street_name', 'beat_of_occurrence', 'num_units', 'most_severe_injury', 
        'injuries_fatal', 'injuries_incapacitating',
       'injuries_non_incapacitating', 'injuries_reported_not_evident',
       'injuries_no_indication', 'injuries_unknown', 'crash_hour',
       'crash_day_of_week', 'crash_month', 'latitude', 'longitude', 'lane_cnt',
       'intersection_related_i', 'crash_date_est_i',
       'work_zone_i', 'work_zone_type',
       'workers_present_i']

for col in col_interest:
    print(col, crash_df[col].unique())

traffic_control_device ['NO CONTROLS' 'STOP SIGN/FLASHER' 'TRAFFIC SIGNAL' 'UNKNOWN'
 'OTHER REG. SIGN' 'LANE USE MARKING' 'DELINEATORS'
 'FLASHING CONTROL SIGNAL' 'POLICE/FLAGMAN' 'RAILROAD CROSSING GATE'
 'SCHOOL ZONE' 'OTHER RAILROAD CROSSING' 'RR CROSSING SIGN' 'NO PASSING'
 'BICYCLE CROSSING SIGN']
device_condition ['NO CONTROLS' 'FUNCTIONING PROPERLY' 'NOT FUNCTIONING' 'UNKNOWN' 'OTHER'
 'FUNCTIONING IMPROPERLY' 'WORN REFLECTIVE MATERIAL' 'MISSING']
weather_condition ['CLEAR' 'RAIN' 'UNKNOWN' 'SNOW' 'CLOUDY/OVERCAST' 'SLEET/HAIL'
 'FREEZING RAIN/DRIZZLE' 'FOG/SMOKE/HAZE' 'OTHER' 'BLOWING SNOW'
 'SEVERE CROSS WIND GATE' 'BLOWING SAND, SOIL, DIRT']
lighting_condition ['DAYLIGHT' 'DARKNESS' 'DARKNESS, LIGHTED ROAD' 'UNKNOWN' 'DAWN' 'DUSK']
first_crash_type ['TURNING' 'REAR END' 'PARKED MOTOR VEHICLE'
 'SIDESWIPE OPPOSITE DIRECTION' 'ANGLE' 'SIDESWIPE SAME DIRECTION'
 'OTHER OBJECT' 'HEAD ON' 'PEDESTRIAN' 'FIXED OBJECT' 'PEDALCYCLIST'
 'REAR TO FRONT' 'REAR TO SIDE' 'REAR TO REAR' 'A

work_zone_i [nan 'Y' 'N']
work_zone_type [nan 'MAINTENANCE' 'CONSTRUCTION' 'UTILITY' 'UNKNOWN']
workers_present_i [nan 'Y' 'N']


### Create all_crashes TABLE
We create this and will continue to process it to deliver a table with only crashes at signaled intersections.

In [46]:
make_table(crash_df, 'all_crashes', c, conn)

[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',)]


### Filter for desired crashes (intersections with signal)
The contents of crash data section has helped us.  
We can filter 'traffic_control_device' == 'TRAFFIC SIGNAL'.  
We can filter 'intersection_related_i' == 'Y'

This will leave us with only crashes that occurred at/because of intersections, and with a signal at the intersection.

intersection_related_i: A field observation by the police officer whether an intersection played a role in the crash. Does not represent whether or not the crash occurred within the intersection.

In [47]:
crash_df = crash_df[(crash_df['traffic_control_device']=='TRAFFIC SIGNAL') & \
                    (crash_df['intersection_related_i']=='Y')]

### Add intersection to my signal crashes

Look up each lat long for crash and get the corresponding intersection if we have it.

Tried to look up using geo equations took forever.  Also slow using pythag thrm, took long time.  Finally settled on using a simple box.  I basically check to see if the crash location is inside a box defining the intersection (within 50m of the intersection centerpoint)

In [48]:
#⏳⏳⏳⏳⏳
# This takes too long to process.  Let's simplify it and make it a box instead.
box_side = 100  # effectively makes it check for crash being within 40m of intersection
box_lat = box_side / 111070 / 2 # 111070 is meters in deg lat in Chicago
box_long = box_side / 83000 / 2 # 83000 is meters in deg long in Chicago

def box_check(lat, long, int_df):
    answer = (int_df[  (int_df['lat'] > (lat - box_lat)) & 
                      (int_df['lat'] < (lat + box_lat)) &
                      (int_df['long'] > (long - box_long)) &
                      (int_df['long'] < (long + box_long))
                     ])
    if answer.empty: return None
    return answer['intersection'].values[0]
    
# THIS SEEMS TO WORK WITH SPEED AND ELIMINATES MEMORY PROBLEM
crash_df['intersection'] = crash_df.apply(lambda x: box_check(float(x.latitude), 
                                                              float(x.longitude), 
                                                              int_df), axis=1)

Extract the integer date data from the crash_date column

In [49]:
# #Ex: df[['two', 'three']] = df[['two', 'three']].astype(float)
crash_df['crash_date'] = pd.to_datetime(crash_df['crash_date'])
crash_df['year'] = crash_df['crash_date'].apply(lambda x: int(x.year))
crash_df['month'] = crash_df['crash_date'].apply(lambda x: int(x.month))
crash_df['day'] = crash_df['crash_date'].apply(lambda x: int(x.day))
crash_df['hour'] = crash_df['crash_date'].apply(lambda x: int(x.hour))

In [50]:
crash_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61100 entries, 5 to 472016
Data columns (total 48 columns):
crash_record_id                  61100 non-null object
rd_no                            60631 non-null object
crash_date                       61100 non-null datetime64[ns]
posted_speed_limit               61100 non-null object
traffic_control_device           61100 non-null object
device_condition                 61100 non-null object
weather_condition                61100 non-null object
lighting_condition               61100 non-null object
first_crash_type                 61100 non-null object
trafficway_type                  61100 non-null object
alignment                        61100 non-null object
roadway_surface_cond             61100 non-null object
road_defect                      61100 non-null object
report_type                      59239 non-null object
crash_type                       61100 non-null object
damage                           61100 non-null object
pr

### Create signal_crashes TABLE
Create from crash_df.  These 60k crashes are labeled with region_id and intersection as foreign keys for SQL queries.

In [51]:
make_table(crash_df, 'signal_crashes', c, conn)

[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',)]


## 6) Build hourly_congestion TABLE 

Chicago tracks periodic (irregular) bus speed data.  The data is aggregated so that we get an average bus_speed for every region at least once an hour while buses are running.  

Build from all_traffic DataFrame.

For this one, we have to combine two different datasets.  Chicago changed the way data was recorded in 2018.  Columns are similar, but more data collected.

### First dataset
Won't be able to table it until we get both datasets.

In [52]:
# Congestion Data
traffic_df = client.get("emtn-qqdi", 
                     #where="TIME > \'2015-01-01T00:00:00.000\'",
                     where='TIME > \'2016-01-01T00:00:00.000\'',
                     limit=10000000,
                    )

traffic_df = pd.DataFrame.from_records(traffic_df) # Convert to pandas DataFrame

In [53]:
traffic_df.rename(columns={'number_of_reads':'num_reads'}, inplace=True)
traffic_df['time'] = pd.to_datetime(traffic_df['time'])
traffic_df['bus_count'] = traffic_df['bus_count'].astype(int)
traffic_df['num_reads'] = traffic_df['num_reads'].astype(int)
traffic_df['speed'] = traffic_df['speed'].astype(float)

### Second dataset

In [54]:
# Congestion data from later
traffic_df2 = client.get("kf7e-cur8", #2018 to present
                     select='time, region_id, speed, bus_count, num_reads',  # this set is huge, so we won't get all       
                     where="TIME < \'2021-01-01T00:00:00.000\'",
                     limit=10000000,
                    )

# Convert to pandas DataFrame
traffic_df2 = pd.DataFrame.from_records(traffic_df2)


In [55]:
#traffic2_df.rename(columns={'number_of_reads':'num_reads'}, inplace=True)
traffic_df2['time'] = pd.to_datetime(traffic_df2['time'])
traffic_df2['bus_count'] = traffic_df2['bus_count'].astype(int)
traffic_df2['num_reads'] = traffic_df2['num_reads'].astype(int)
traffic_df2['speed'] = traffic_df2['speed'].astype(float)

### Preprocess hourly_congestion
We have two separate traffic_dfs.  There is data prior to 2018 and after in two different api endpoints.


In [56]:
traffic_df.head()
traffic_df2.head()
traffic_df2.info()
print()
traffic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3959185 entries, 0 to 3959184
Data columns (total 5 columns):
time         datetime64[ns]
region_id    object
speed        float64
bus_count    int64
num_reads    int64
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 151.0+ MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3507463 entries, 0 to 3507462
Data columns (total 5 columns):
time         datetime64[ns]
region_id    object
bus_count    int64
num_reads    int64
speed        float64
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 133.8+ MB


In [57]:
# Merge my two data sets for congestion by region
all_traffic = pd.merge(traffic_df, traffic_df2, how='outer')
print('traffic dfs merged')

traffic dfs merged


In [58]:
all_traffic['hour'] = all_traffic['time'].dt.hour
print('added hour column')

all_traffic['day'] = all_traffic.time.dt.day
print('added day column')

all_traffic['month'] = all_traffic.time.dt.month
print('added month column')

all_traffic['year'] = all_traffic.time.dt.year
print('added year column')

all_traffic['weekday'] = all_traffic.time.dt.weekday
print('added weekday column')

added hour column
added day column
added month column
added year column
added weekday column


In [59]:
print(len(all_traffic))  # lots of dupes 
all_traffic = all_traffic.groupby(['year', 'month', 'day', 'hour', 'region_id']).mean().reset_index()
all_traffic.info()

7256572
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1242940 entries, 0 to 1242939
Data columns (total 9 columns):
year         1242940 non-null int64
month        1242940 non-null int64
day          1242940 non-null int64
hour         1242940 non-null int64
region_id    1242940 non-null object
bus_count    1242940 non-null float64
num_reads    1242940 non-null float64
speed        1242940 non-null float64
weekday      1242940 non-null float64
dtypes: float64(4), int64(4), object(1)
memory usage: 85.3+ MB


In [60]:
all_traffic.head()

Unnamed: 0,year,month,day,hour,region_id,bus_count,num_reads,speed,weekday
0,2016,1,1,0,1,6.8,120.2,27.062,4.0
1,2016,1,1,0,10,34.0,390.8,24.834,4.0
2,2016,1,1,0,11,16.2,265.8,26.004,4.0
3,2016,1,1,0,12,25.0,350.8,16.526,4.0
4,2016,1,1,0,13,33.6,511.4,18.136,4.0


In [61]:
# couple minutes
make_table(all_traffic, 'hourly_congestion', c, conn)

[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('hourly_congestion',)]


### Speed fix for congestion
Congestion is measured by average bus speed.

The problem:
- Overnight (between 11 and 5am) we have few bus routes.
- Some regions have no buses overnight
- Some regions have only a few buses 
- Some buses are ending routes and have only a few reads
- Some buses are stationary (next morning staging)

The fix:
- replace speed for few buses/reads if speed is low
- we assume low buses/reads to be overnight when congestion is minimal
- replacement speed is a low congestion quantile speed (90% or so)

In [62]:
## Let's get the 0.90 quantile for every region, and then use that to fill in missing data

regions_90 = all_traffic.groupby(['region_id'])['speed'].quantile(0.9).reset_index()


In [63]:
regions_90.head()


Unnamed: 0,region_id,speed
0,1,25.068333
1,10,26.4534
2,11,26.808033
3,12,23.136667
4,13,24.143667


In [64]:
#### 5 MINUTES OR SO

#my read on this is that few buses run 24/7, so the data is unreliable.  
# buses stage for next morning.  You can see them all along Clark, LSD etc.  
# They have speed=0 and may be recording.  Could talk to owner of dataset.

# I will draw the cutoff at 100 reads, 5 buses, speed < 10
# in that case I will put in a quantile speed for the region


def speed_check(bus, speed, reads, region_id, regions_90):
    if (bus <= 5 or reads < 100) and speed < 25 or speed > 40:
        return regions_90[regions_90['region_id']==region_id]['speed'].values[0]
    else:
        return speed
    

# apply is SLOOOOOOWWW, but not sure how else to accomplish this without iterating
all_traffic['speed'] = all_traffic.apply(lambda x: speed_check(x.bus_count, x.speed, x.num_reads, x.region_id, regions_90), axis=1)
      


### Create hourly_congestion TABLE

In [65]:
make_table(all_traffic, 'hourly_congestion', c, conn)

[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('hourly_congestion',)]


#### Add quantile speed
In models, congestion was not as useful as the EDA suggested.  I would like to create a quantile speed for each area.  That way downtown traffic midday is not worse than peak traffic in other regions.





In [66]:
import timeit

# This is slow!  Original speed took about 0.01s per iteration!  (200 minutes!!!!)
# used dictionary of smaller dfs and looked up inside lambda function (0.003 per iteration) (60 minutes!!!)

# split into 29 dfs, cycle perform lambda on each (no need for df.apply x, could do series.apply) (2 minutes...)

def quant_speed(speed, df):
    # returns a percentile speed for the region
    quant = percentileofscore(df['speed'], speed)
    return quant
    
    
new_df = pd.DataFrame(columns=all_traffic.columns)
new_df['quantile_speed'] = None

for i in all_traffic.region_id.unique():
    df = all_traffic[all_traffic['region_id'] == str(i)]
    df['quantile_speed'] = df['speed'].apply(lambda x: quant_speed(x, df))
    new_df = new_df.append(df)
    
new_df.head()
# filtering first and pull from dict increased speed by 15x by using dictionary lookup
# starttime = timeit.default_timer()
# all_traffic['quantile_speed'] = all_traffic.apply(lambda x: quant_speed(x.speed, x.region_id, df_dict[x.region_id]), axis=1)
# print("The time difference is :", timeit.default_timer() - starttime)

Unnamed: 0,year,month,day,hour,region_id,bus_count,num_reads,speed,weekday,quantile_speed
0,2016,1,1,0,1,6.8,120.2,27.062,4.0,97.881475
29,2016,1,1,1,1,5.0,69.5,25.068333,4.0,82.09286
58,2016,1,1,2,1,3.0,34.0,25.068333,4.0,82.09286
87,2016,1,1,3,1,3.5,58.166667,25.068333,4.0,82.09286
116,2016,1,1,4,1,3.666667,43.0,25.068333,4.0,82.09286


In [67]:
new_df.region_id.unique()

array(['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19',
       '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29',
       '3', '4', '5', '6', '7', '8', '9'], dtype=object)

In [68]:
make_table(new_df, 'hourly_congestion', c, conn)

[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('hourly_congestion',)]


## 7) Create hourly_weather 
Created from wx_df dataframe.  

This data is hourly historical data from a bulk query from openweathermap.org for the city of Chicago.
Columns of interest:
- temp (deg K)
- rain_1h (mm rain in last hour)
- snow_1h (mm snow in last hour)
- weather_main (description)


### Preprocess hourly_weather

In [69]:
# Import weather data from csv
wx_df = pd.read_csv('data/chi_wx.csv')

In [70]:
wx_df.weather_main.unique()

array(['Clear', 'Clouds', 'Snow', 'Mist', 'Rain', 'Drizzle', 'Haze',
       'Fog', 'Thunderstorm', 'Smoke', 'Tornado', 'Dust', 'Squall'],
      dtype=object)

In [71]:
wx_df['time'] = pd.to_datetime(wx_df['dt_iso'].apply(lambda x: x[:-4]))
wx_df.head()
wx_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57306 entries, 0 to 57305
Data columns (total 26 columns):
dt                     57306 non-null int64
dt_iso                 57306 non-null object
timezone               57306 non-null int64
city_name              57306 non-null object
lat                    57306 non-null float64
lon                    57306 non-null float64
temp                   57306 non-null float64
feels_like             57306 non-null float64
temp_min               57306 non-null float64
temp_max               57306 non-null float64
pressure               57306 non-null int64
sea_level              0 non-null float64
grnd_level             0 non-null float64
humidity               57306 non-null int64
wind_speed             57306 non-null float64
wind_deg               57306 non-null int64
rain_1h                6750 non-null float64
rain_3h                820 non-null float64
snow_1h                1991 non-null float64
snow_3h                113 non-null float

In [72]:
wx_df['rain_3h'] = wx_df['rain_3h'].fillna(0)
wx_df['rain_1h'] = wx_df['rain_1h'].fillna(0)
wx_df['snow_3h'] = wx_df['snow_3h'].fillna(0)
wx_df['snow_1h'] = wx_df['snow_1h'].fillna(0)
wx_df['temp'] = wx_df['temp_max']
wx_df['year'] = wx_df.time.dt.year
wx_df['month'] = wx_df.time.dt.month
wx_df['day'] = wx_df.time.dt.day
wx_df['hour'] = wx_df.time.dt.hour
wx_df['weekday'] = wx_df.time.dt.weekday

In [73]:
wx_df.describe()

Unnamed: 0,dt,timezone,lat,lon,temp,feels_like,temp_min,temp_max,pressure,sea_level,...,rain_3h,snow_1h,snow_3h,clouds_all,weather_id,year,month,day,hour,weekday
count,57306.0,57306.0,57306.0,57306.0,57306.0,57306.0,57306.0,57306.0,57306.0,0.0,...,57306.0,57306.0,57306.0,57306.0,57306.0,57306.0,57306.0,57306.0,57306.0,57306.0
mean,1517459000.0,-19338.016962,41.87811,-87.6298,285.037523,279.69464,281.447694,285.037523,1016.055526,,...,0.046291,0.015908,0.001911,61.439361,750.328482,2017.599745,6.354082,15.741196,11.457648,3.002181
std,55886730.0,1739.719658,1.421098e-14,2.842196e-14,11.189403,13.319188,10.489184,11.189403,7.644737,,...,0.695364,0.127158,0.066529,32.006648,111.551016,1.768262,3.527509,8.822858,6.909577,2.001015
min,1420070000.0,-21600.0,41.87811,-87.6298,245.37,233.18,242.15,245.37,965.0,,...,0.0,0.0,0.0,0.0,200.0,2015.0,1.0,1.0,0.0,0.0
25%,1469056000.0,-21600.0,41.87811,-87.6298,276.103,269.34,273.71,276.103,1011.0,,...,0.0,0.0,0.0,40.0,800.0,2016.0,3.0,8.0,5.0,1.0
50%,1517942000.0,-18000.0,41.87811,-87.6298,284.26,278.355,280.645,284.26,1016.0,,...,0.0,0.0,0.0,75.0,802.0,2018.0,6.0,16.0,11.0,3.0
75%,1565992000.0,-18000.0,41.87811,-87.6298,294.85,291.51,290.7145,294.85,1021.0,,...,0.0,0.0,0.0,90.0,803.0,2019.0,9.0,23.0,17.0,5.0
max,1613430000.0,-18000.0,41.87811,-87.6298,311.48,309.89,306.132,311.48,1044.0,,...,35.0,8.4,6.0,100.0,804.0,2021.0,12.0,31.0,23.0,6.0


In [74]:
try:
    wx_df = wx_df.drop(columns=['dt', 
                        'dt_iso', 
                        'timezone', 
                        'city_name', 
                        'lat', 
                        'lon', 
                        'feels_like', 
                        'temp_min', 
                        'temp_max',
                        'pressure',
                        'sea_level',
                        'grnd_level',
                        'humidity',
                        'wind_speed',
                        'wind_deg',
                        'clouds_all',
                        'weather_description',
                        'weather_icon',
                        'weather_id',
                       ], axis=1)
except:
    print('Failed')

In [75]:
print(len(wx_df))
print(wx_df.duplicated().sum())


print('Total hours in 6 years:', 365.25 * 24 * 6)
print('Unique entries:', len(wx_df.drop_duplicates()))  
# missing a few entries (700+ out of 52k)  Am I missing a month??

print()
print(wx_df.time.min(), wx_df.time.max())  # OH!!!!  I am missin last month
print('Total hours in 6 years (-1 mos):', 365.25 * 24 * 6 - 31 * 24)  # okay, we are only missing a few


wx_df.drop_duplicates(inplace=True)

57306
0
Total hours in 6 years: 52596.0
Unique entries: 57306

2015-01-01 00:00:00+00:00 2021-02-15 23:00:00+00:00
Total hours in 6 years (-1 mos): 51852.0


### Create hourly_weather TABLE

In [76]:
make_table(wx_df, 'hourly_weather', c, conn)

[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('hourly_congestion',), ('hourly_weather',)]


## 8) Create region_data TABLE 
Created from region_df dataframe.

Wanted to add a table which contained the region information.  Congestion TABLE uses region_id to break city into 29 traffic regions.  We use the regions for congestion (bus speed) data.

### Preprocess region_data TABLE

In [77]:
# THis time we only grab what we need

region_df = client.get("kf7e-cur8", # regional congestion current data
                         select='region_id, region, description, north, south, east, west',
                         limit=1000
                    )

# Convert to pandas DataFrame
region_df = pd.DataFrame.from_records(region_df)  # should only return most recent for each region

In [78]:
region_df = region_df.groupby('region_id').max().reset_index()

In [79]:
# need these as floats so we can compare them
region_df[['north', 'south', 'east', 'west']] = region_df[['north', 'south', 'east', 'west']].astype(float)
region_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 7 columns):
region_id      29 non-null object
region         29 non-null object
description    29 non-null object
north          29 non-null float64
south          29 non-null float64
east           29 non-null float64
west           29 non-null float64
dtypes: float64(4), object(3)
memory usage: 1.7+ KB


### Add region to my crash df

In [80]:
crash_df[['latitude', 'longitude']] = crash_df[['latitude', 'longitude']].astype(float)
crash_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61100 entries, 5 to 472016
Data columns (total 48 columns):
crash_record_id                  61100 non-null object
rd_no                            60631 non-null object
crash_date                       61100 non-null datetime64[ns]
posted_speed_limit               61100 non-null object
traffic_control_device           61100 non-null object
device_condition                 61100 non-null object
weather_condition                61100 non-null object
lighting_condition               61100 non-null object
first_crash_type                 61100 non-null object
trafficway_type                  61100 non-null object
alignment                        61100 non-null object
roadway_surface_cond             61100 non-null object
road_defect                      61100 non-null object
report_type                      59239 non-null object
crash_type                       61100 non-null object
damage                           61100 non-null object
pr

In [81]:
# add in the region for my crashes
# Resource hog
crash_df.columns


def which_region(lat, long, region_df):
    #print(lat, long)
    row = region_df[(region_df['east'] >= long) &
                    (region_df['west'] < long) &
                    (region_df['north'] >= lat) &
                    (region_df['south'] < lat)]['region_id'].max()
    return row

#df.iloc[:5]
# takes some 5min
crash_df['region_id'] = crash_df.apply(lambda x: which_region(x.latitude, x.longitude, region_df), axis=1)

In [82]:
len(crash_df)
crash_df.columns

crash_df['time'] = pd.to_datetime(crash_df.crash_date)
crash_df['year'] = crash_df.time.dt.year
crash_df['month'] = crash_df.time.dt.month
crash_df['day'] = crash_df.time.dt.day
crash_df['hour'] = crash_df.time.dt.hour
crash_df['weekday'] = crash_df.time.dt.weekday

### Create region_df TABLE and update signal_crashes TABLE

With new info added to crash table, we overwrite previous table.

In [83]:
make_table(region_df, 'region_data', c, conn)
print()
make_table(crash_df, 'signal_crashes', c, conn)  # also update my crash data

[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',)]

[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersection_cams',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('signal_crashes',)]


### Add region_id to intersection_cams
While I'm here and I have the function ready.
I would like to add region_id number to my red light camera (daily_violations TABLE)
The region there will help me link the daily_violations and hourly_congestion TABLEs

*** NOTE: Makes more sense to come back and put the region into the intersection_cameras table to speed this up

In [84]:
# 1 minutes
#rlc['region_id'] = 
int_cams['region_id'] = int_cams.apply(lambda x: which_region(x.latitude, x.longitude, region_df), axis=1)


In [85]:
## commit my change
make_table(int_cams, 'intersection_cams', c, conn)

[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('signal_crashes',), ('intersection_cams',)]


<b> Use this code to test any of your tables for proper data storage </b>

In [86]:
query = c.execute("SELECT camera_id, violations FROM daily_violations;").fetchall()
print(query[:5])
print(len(query))

[('2141', 3), ('1612', 7), ('1413', 75), ('1814', 8), ('1014', 6)]
388761


### Add intersections to signal_crashes 
Want to add this as a foreign key to help with queries.

In [87]:
sql_fetch_tables(c, conn)

[('intersection_chars',),
 ('cam_locations',),
 ('cam_startend',),
 ('daily_violations',),
 ('all_crashes',),
 ('hourly_congestion',),
 ('hourly_weather',),
 ('region_data',),
 ('signal_crashes',),
 ('intersection_cams',)]

In [88]:
# read data back in to prevent having to rerun code.
df = pd.read_sql_query("SELECT * FROM signal_crashes", conn)
camloc_df = pd.read_sql_query('SELECT * FROM cam_locations', conn)
ints_df = pd.read_sql_query('SELECT * FROM intersection_cams', conn)


In [89]:
#ints_df.astype({'longitude':float})
pd.options.display.max_rows = 200


### Add intersection data to crashes

In [90]:
# Let's simplify it and make it a box instead.
box_side = 100  # effectively makes it check for crash being within 25m of interscection
box_lat = box_side / 111070 / 2 # 111070 is meters in deg lat in Chicago
box_long = box_side / 83000 / 2 # 83000 is meters in deg long in Chicago

def box_check(lat, long, int_df):
    n = lat + box_lat
    s = lat - box_lat
    e = long + box_long
    w = long - box_long
    # print('n', n, 's', s, 'e', e, 'w', w, 'lat:', lat, 'long:', long)
    answer = int_df[  (int_df['lat'] > s) &
                      (int_df['lat'] < n) &
                      (int_df['long'] > w) &
                      (int_df['long'] < e)
                      
                     ]
    if answer.empty: return None
    return answer['intersection'].values[0]

    
# THIS SEEMS TO WORK AT SPEED AND ELIMINATES MEMORY PROBLEM
# this code is just to test out a chunk of data
for i in range(5000, 5100): 
    lat = float(df.iloc[i]['latitude'])
    long = float(df.iloc[i]['longitude'])
    n = lat + box_lat
    s = lat - box_lat
    e = long + box_long
    w = long - box_long
    answer = int_df[  (int_df['lat'] > s) &
                      (int_df['lat'] < n) &
                      (int_df['long'] > w) &
                      (int_df['long'] < e)]['intersection'].values
    if len(answer): print(answer[0])
    
# 99th Halsted: 41.714230	-87.643043
# MOMENT OF TRUTH
df['intersection'] = df.apply(lambda x: box_check(float(x.latitude), float(x.longitude), int_df), axis=1)



HALSTED AND 95TH
CALIFORNIA AND DEVON
PULASKI AND 63RD
AUSTIN AND ADDISON
HOMAN/KIMBALL AND NORTH
CICERO AND DIVERSEY
BELMONT AND KEDZIE
MONTROSE AND WESTERN
87TH AND VINCENNES
HALSTED AND MADISON
STONY ISLAND/CORNELL AND 67TH
CICERO AND FULLERTON
CENTRAL AND IRVING PARK
WESTERN AND NORTH


### Overwrite signal_crashes TABLE
Now it is updated with intersections

In [91]:
make_table(df, 'signal_crashes', c, conn)


[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('signal_crashes',)]


## 9) Create all_hours TABLE
Make a table that just has every date and hour for every intersection.  Will help out my queries so I can use LEFT JOINS only.  Will use the weather data (hourly) to build my table.

In [92]:
#int_chars = pd.read_sql_query("SELECT * FROM intersection_chars", conn)
v_df = pd.read_sql_query("SELECT * FROM daily_violations", conn)
wx_df = pd.read_sql_query("SELECT * FROM hourly_weather", conn)
rlc_df = pd.read_sql_query("SELECT * FROM daily_violations", conn)

In [93]:
grouped_dates = wx_df.groupby(['year', 'month', 'day', 'hour']).max().reset_index()[['year', 'month', 'day', 'hour']]
grouped_dates = grouped_dates[grouped_dates['year']>2015]

In [94]:
# didn't work 
grouped_dates.head()
grouped_dates.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44952 entries, 8760 to 53711
Data columns (total 4 columns):
year     44952 non-null int64
month    44952 non-null int64
day      44952 non-null int64
hour     44952 non-null int64
dtypes: int64(4)
memory usage: 1.7 MB


In [95]:
big_df = grouped_dates.copy()

big_df['datetime'] = big_df.apply(lambda x: datetime(x.year, x.month, x.day, x.hour), axis=1)

In [96]:
my_ints = rlc_df.intersection.unique()

In [97]:
# Should only have to do this once
#for i in range(len(wx_df)):
big_df['intersection'] = my_ints[0]


for i in range(1, len(my_ints)):
    df = grouped_dates.copy()
    df['intersection'] = my_ints[i]
    big_df = pd.concat([big_df, df])



In [98]:
big_df.intersection.isna().sum()

0

In [99]:
big_df.intersection.unique()
big_df.dropna(subset=['intersection'], inplace=True)


In [100]:
big_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7057464 entries, 8760 to 53711
Data columns (total 6 columns):
datetime        datetime64[ns]
day             int64
hour            int64
intersection    object
month           int64
year            int64
dtypes: datetime64[ns](1), int64(4), object(1)
memory usage: 376.9+ MB


In [101]:
big_df.head()

Unnamed: 0,datetime,day,hour,intersection,month,year
8760,2016-01-01 00:00:00,1,0,4700 WESTERN,1,2016
8761,2016-01-01 01:00:00,1,1,4700 WESTERN,1,2016
8762,2016-01-01 02:00:00,1,2,4700 WESTERN,1,2016
8763,2016-01-01 03:00:00,1,3,4700 WESTERN,1,2016
8764,2016-01-01 04:00:00,1,4,4700 WESTERN,1,2016


### Create all_hours TABLE

In [102]:
make_table(big_df, 'all_hours', c, conn)

[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('signal_crashes',), ('all_hours',)]


## 10) Build int_startend TABLE 

Create a table which serves to provide start and end dates for camera functioning at every intersection.

We will just look at the earliest and latest entries from the violations.
When cameras are off, they just disappear from the database.

All cameras have gaps (maintenance and malfunctions). I do not differentiate these in this project.

### Preprocess int_startend
Add intersection start-end dates

In [103]:
v_df = pd.read_sql_query("SELECT * FROM daily_violations", conn)

In [104]:
v_df['start'] = v_df['violation_date']
v_df['end'] = v_df['violation_date']

start_end = v_df.groupby('intersection').agg({'start':'min', 'end':'max'}).reset_index()

In [105]:
start_end.start.sort_values().tail(50)
start_end.head()

Unnamed: 0,intersection,start,end
0,111TH AND HALSTED,2017-01-02 00:00:00,2021-02-15 00:00:00
1,115TH AND HALSTED,2017-01-02 00:00:00,2017-10-26 00:00:00
2,119TH AND HALSTED,2017-01-02 00:00:00,2021-02-15 00:00:00
3,31ST ST AND MARTIN LUTHER KING DRIVE,2017-01-02 00:00:00,2021-02-15 00:00:00
4,35TH AND WESTERN,2017-01-02 00:00:00,2021-02-15 00:00:00


### Create start_end

In [106]:
make_table(start_end, 'int_startend', c, conn)

[('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('signal_crashes',), ('all_hours',), ('int_startend',)]


## Add speed to int_char TABLE
Use the average (or mode speed) from crashes at intersection to add to my intersection charateristics.

In [107]:
speed_df = pd.read_sql_query("SELECT * FROM signal_crashes", conn)
int_char_df = pd.read_sql_query("SELECT * FROM intersection_chars", conn)


In [108]:
speed_df.posted_speed_limit = speed_df.posted_speed_limit.astype(int)

In [109]:
speed_df = speed_df.groupby('intersection').agg({'posted_speed_limit':max}).reset_index()  # mode is from scipy

In [110]:
speed_df.posted_speed_limit.unique()
speed_df.head()

Unnamed: 0,intersection,posted_speed_limit
0,111TH AND HALSTED,35
1,115TH AND HALSTED,35
2,119TH AND HALSTED,35
3,31ST AND CALIFORNIA,35
4,31ST ST AND MARTIN LUTHER KING DRIVE,35


In [111]:

# df is defined globally
def speed_lookup(intersection):
    # needed to put a try in there because one intersection had no crashes over time period
    try:
        speed = speed_df[speed_df['intersection']==intersection]['posted_speed_limit'].values[0]
    except:
        speed=30
    return speed
    
int_char_df['speed'] = int_char_df['intersection'].apply(speed_lookup)

In [112]:
int_char_df.head()

Unnamed: 0,protected_turn,total_lanes,medians,exit,split,way,underpass,no_left,angled,triangle,one_way,turn_lanes,lat,long,rlc,intersection,daily_traffic,speed
0,2,6,2,0,0,4,0,0,1,0,0,2,41.692362,-87.642423,1,111TH AND HALSTED,43100,35
1,4,6,2,0,0,4,0,0,0,0,0,4,41.685089,-87.642094,1,115TH AND HALSTED,42500,35
2,4,6,2,0,0,4,0,0,0,0,0,4,41.677774,-87.64193,1,119TH AND HALSTED,41800,35
3,2,6,0,0,0,4,0,0,0,0,0,4,41.837424,-87.695022,1,31ST AND CALIFORNIA,41100,35
4,2,10,2,0,1,4,0,2,0,0,0,0,41.838441,-87.617338,1,31ST ST AND MARTIN LUTHER KING DRIVE,36500,35


### Rewrite int_char

In [113]:
make_table(int_char_df, 'intersection_chars', c, conn)

[('cam_locations',), ('cam_startend',), ('daily_violations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('signal_crashes',), ('all_hours',), ('int_startend',), ('intersection_chars',)]


### Add Red Light On/Off to crash data

In [118]:
crash_df2 = pd.read_sql_query("SELECT * FROM signal_crashes", conn)
se_df = pd.read_sql_query("SELECT * FROM int_startend", conn)

In [119]:
crash_df2 = crash_df.merge(se_df, on='intersection', how='left')


In [120]:
crash_df2.columns

Index(['crash_record_id', 'rd_no', 'crash_date', 'posted_speed_limit',
       'traffic_control_device', 'device_condition', 'weather_condition',
       'lighting_condition', 'first_crash_type', 'trafficway_type',
       'alignment', 'roadway_surface_cond', 'road_defect', 'report_type',
       'crash_type', 'damage', 'prim_contributory_cause',
       'sec_contributory_cause', 'street_no', 'street_direction',
       'street_name', 'beat_of_occurrence', 'num_units', 'most_severe_injury',
       'injuries_total', 'injuries_fatal', 'injuries_incapacitating',
       'injuries_non_incapacitating', 'injuries_reported_not_evident',
       'injuries_no_indication', 'injuries_unknown', 'crash_hour',
       'crash_day_of_week', 'crash_month', 'latitude', 'longitude', 'lane_cnt',
       'intersection_related_i', 'hit_and_run_i', 'crash_date_est_i',
       'work_zone_i', 'work_zone_type', 'workers_present_i', 'intersection',
       'year', 'month', 'day', 'hour', 'region_id', 'time', 'weekday', 'sta

In [121]:
crash_df2['start'] = pd.to_datetime(crash_df2['start'])
crash_df2['end'] = pd.to_datetime(crash_df2['end'])
crash_df2['crash_date'] = pd.to_datetime(crash_df2['crash_date'])

In [122]:
# need to determine if crash occurred in or outside of cam on dates

def rlc_state(start, end, my_date):

    if (end - my_date).days >= 0 and (my_date - start).days >= 0:
        return 1
    elif (my_date - end).days > 0:
        return 0
    elif (start - my_date).days > 0:
        return 0
    else:
        return None

crash_df2['rlc_state'] = crash_df2.apply(lambda x: rlc_state(x.start, x.end, x.crash_date), axis=1)

In [123]:
print('Total crashes rlc on: {}'.format(crash_df2['rlc_state'].sum()))
print('Total crashes rlc off: {}'.format(crash_df2[crash_df2['rlc_state']==0]['crash_date'].count()))


Total crashes rlc on: 6675.0
Total crashes rlc off: 1134


In [124]:
make_table(crash_df2, 'signal_crashes', c, conn)

[('cam_locations',), ('cam_startend',), ('daily_violations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('all_hours',), ('int_startend',), ('intersection_chars',), ('signal_crashes',)]


## Add camera state to all_hours
Needed to do this to make my queries easier since I don't have an outer 
join in SQLite3.  This takes a significant amount of time.  Hours 

In [125]:

all_hours = pd.read_sql_query("SELECT * FROM all_hours", conn)


In [126]:
#all_hours['datetime'] = pd.to_datetime(all_hours)
all_hours.head()

Unnamed: 0,datetime,day,hour,intersection,month,year
0,2016-01-01 00:00:00,1,0,4700 WESTERN,1,2016
1,2016-01-01 01:00:00,1,1,4700 WESTERN,1,2016
2,2016-01-01 02:00:00,1,2,4700 WESTERN,1,2016
3,2016-01-01 03:00:00,1,3,4700 WESTERN,1,2016
4,2016-01-01 04:00:00,1,4,4700 WESTERN,1,2016


In [127]:
all_hours.datetime = all_hours.apply(lambda x: datetime(int(x.year), x.month, x.day, x.hour), axis=1)

In [128]:
all_hours.tail()

Unnamed: 0,datetime,day,hour,intersection,month,year
7057459,2021-02-15 19:00:00,15,19,PULASKI AND PETERSON,2,2021
7057460,2021-02-15 20:00:00,15,20,PULASKI AND PETERSON,2,2021
7057461,2021-02-15 21:00:00,15,21,PULASKI AND PETERSON,2,2021
7057462,2021-02-15 22:00:00,15,22,PULASKI AND PETERSON,2,2021
7057463,2021-02-15 23:00:00,15,23,PULASKI AND PETERSON,2,2021


In [129]:
se_df = pd.read_sql_query("SELECT * FROM int_startend", conn)
se_df['start'] = pd.to_datetime(se_df['start'])
se_df['end'] = pd.to_datetime(se_df['end'])

se_df.intersection

se_dict = {x:[se_df[se_df['intersection']==x]['start'].values[0],  
              se_df[se_df['intersection']==x]['end'].values[0]] for x in se_df.intersection}


In [130]:
# Let's make it faster
def get_camstate2(date, start, end):
    if date >= start and date <= end:
        return 1
    else:
        return 0

    
intersections = list(se_dict.keys())
intersection = intersections[0]
new_df = all_hours[all_hours['intersection']==intersection]
new_df['rlc_state'] = new_df['datetime'].apply(lambda x: get_camstate2(x, 
                                                  se_dict[intersection][0],
                                                  se_dict[intersection][1]
                                                    ))
print(new_df.head())
print(intersection, 'complete')

for intersection in intersections[2:]:
    concat_me = all_hours[all_hours['intersection']==intersection]
    concat_me['rlc_state'] = concat_me['datetime'].apply(lambda x: get_camstate2(x, 
                                                  se_dict[intersection][0],
                                                  se_dict[intersection][1]
                                                    ))
    print(intersection, end=',')
    new_df = pd.concat([new_df, concat_me])

print("END")

                   datetime  day  hour       intersection  month  year  \
5663952 2016-01-01 00:00:00    1     0  111TH AND HALSTED      1  2016   
5663953 2016-01-01 01:00:00    1     1  111TH AND HALSTED      1  2016   
5663954 2016-01-01 02:00:00    1     2  111TH AND HALSTED      1  2016   
5663955 2016-01-01 03:00:00    1     3  111TH AND HALSTED      1  2016   
5663956 2016-01-01 04:00:00    1     4  111TH AND HALSTED      1  2016   

         rlc_state  
5663952          0  
5663953          0  
5663954          0  
5663955          0  
5663956          0  
111TH AND HALSTED complete
119TH AND HALSTED,31ST ST AND MARTIN LUTHER KING DRIVE,35TH AND WESTERN,4700 WESTERN,55TH AND KEDZIE,55TH AND WESTERN,55TH and PULASKI,63RD AND STATE,71ST AND ASHLAND,75TH AND STATE,79TH AND HALSTED,79TH AND KEDZIE,87TH AND VINCENNES,95TH AND STONEY ISLAND,99TH AND HALSTED,ADDISON AND HARLEM,ARCHER AND CICERO,ASHLAND AND 87TH,ASHLAND AND 95TH,ASHLAND AND DIVISION,ASHLAND AND FULLERTON,ASHLAND AND IR

The above code was extremely slow until I broke up the df into parts in a loop. 

In [131]:
new_df.head()

Unnamed: 0,datetime,day,hour,intersection,month,year,rlc_state
5663952,2016-01-01 00:00:00,1,0,111TH AND HALSTED,1,2016,0
5663953,2016-01-01 01:00:00,1,1,111TH AND HALSTED,1,2016,0
5663954,2016-01-01 02:00:00,1,2,111TH AND HALSTED,1,2016,0
5663955,2016-01-01 03:00:00,1,3,111TH AND HALSTED,1,2016,0
5663956,2016-01-01 04:00:00,1,4,111TH AND HALSTED,1,2016,0


In [132]:
new_df.isna().sum()

datetime        0
day             0
hour            0
intersection    0
month           0
year            0
rlc_state       0
dtype: int64

In [133]:
new_df.rlc_state.sum()

5351844

In [134]:
make_table(new_df, 'rlc_all_hours', c, conn)


[('cam_locations',), ('cam_startend',), ('daily_violations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('all_hours',), ('int_startend',), ('intersection_chars',), ('signal_crashes',), ('rlc_all_hours',)]


## Adding COVID TABLE

In [135]:
daily_covid= client.get("naz8-j4nc", 
                     limit=20000,
                    )

daily_covid = pd.DataFrame.from_records(daily_covid) # Convert to pandas DataFrame

In [136]:
make_table(daily_covid, 'daily_covid', c, conn)

[('cam_locations',), ('cam_startend',), ('daily_violations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('all_hours',), ('int_startend',), ('intersection_chars',), ('signal_crashes',), ('rlc_all_hours',), ('daily_covid',)]


## Add Holiday TABLE

In [22]:
# Import weather data from csv
holiday_df = pd.read_csv('data/holidays.csv')
holiday_df.head()

Unnamed: 0,index,date,holiday
0,1,2012-01-02,New Year Day
1,2,2012-01-16,Martin Luther King Jr. Day
2,3,2012-02-20,Presidents Day (Washingtons Birthday)
3,4,2012-05-28,Memorial Day
4,5,2012-07-04,Independence Day


In [25]:
holiday_df.date = pd.to_datetime(holiday_df.date)
holiday_df['year'] = holiday_df.date.apply(lambda x: x.year)
holiday_df['month'] = holiday_df.date.apply(lambda x: x.month)
holiday_df['day'] = holiday_df.date.apply(lambda x: x.day)
holiday_df.head()

Unnamed: 0,index,date,holiday,year,month,day
0,1,2012-01-02,New Year Day,2012,1,2
1,2,2012-01-16,Martin Luther King Jr. Day,2012,1,16
2,3,2012-02-20,Presidents Day (Washingtons Birthday),2012,2,20
3,4,2012-05-28,Memorial Day,2012,5,28
4,5,2012-07-04,Independence Day,2012,7,4


In [26]:
make_table(holiday_df, 'holidays', c, conn)

[('cam_locations',), ('cam_startend',), ('daily_violations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('all_hours',), ('int_startend',), ('intersection_chars',), ('signal_crashes',), ('rlc_all_hours',), ('daily_covid',), ('holidays',)]


## Close my connection to the db

In [9]:
c.close()