# Building a SQL database
This notebook builds the necessary db files for the project using SQLite3.

Most data is taken from Chicago Data Portal https://data.cityofchicago.org/ using Socrata library.
The API endpoints for the data are:
- Red Light Violations: https://data.cityofchicago.org/resource/spqx-js37.json
- Congestion by Region 2018-Present: https://data.cityofchicago.org/resource/kf7e-cur8.json
- Congestion by Region 2013-2018: https://data.cityofchicago.org/resource/emtn-qqdi.json
- Traffic Crashes: https://data.cityofchicago.org/resource/85ca-t3if.json

Weather data is taken from https://openweathermap.org/weather-data and is saved as csv in data folder

Tables to build:
- daily_violations (one entry for each camera each day with total violations)
- intersection_locations (one entry for each intersection with lat/long)
- intersection_cams (one entry for each intersection with camera_ids)
- signal_crashes (one entry for each intersection crash with many columns)
- cam_locations (one entry for each cam, with lat/long)
- cam_startend (one entry for each cam with start end dates for min/max dates active)
- hourly_congestion (one entry per hour with bus speed averages for each region)
- hourly_weather (one entry per hour with many weather cols)
- region_data (one entry per region with locations and descriptions to place intersections)


## Required Imports

In [1]:
#!pip install "dask[complete]"
# from dask.distributed import Client, progress
# client = Client(n_workers=2, threads_per_worker=2, memory_limit='1GB')
# client

In [2]:
import pandas as pd
from sodapy import Socrata
#import matplotlib.pyplot as plt
from datetime import datetime
from modules.myfuncs import *
import warnings
import numpy as np
from geopy.geocoders import Nominatim
# import dask
# import dask.dataframe as dd
import gc


warnings.filterwarnings('ignore')

## Create/connect to db and build the TABLEs

In [3]:
# Create a db
conn = create_connection('database/rlc2.db')  # function I created in myfuncs file
c = conn.cursor()
#conn.close()

sqlite3 version: 2.6.0
connected to database/rlc2.db


## Set up the Socrata client


In [4]:
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:

url = "data.cityofchicago.org"
client = Socrata(url, None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofchicago.org,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")



# TABLE builds

For every TABLE
- Use a Socrata client query to get all relevant data
- Preprocess data as needed
- Create Table

Our data
- rlc_cam is up to 1M redlight cams from 2015 to 2020
- crash_data is up to 1M crashes from 2015 to 2020
- traffic_data is up to 10M from 2015 to 2020

Weather data is taken from csv in data folder

## 1) Build intersection_chars TABLE from int_df

In [5]:
from modules.int_chars import *
import pandas as pd

int_chars.keys()
int_df = pd.DataFrame.from_dict(int_chars, orient='index')
int_df['intersection'] = int_chars.keys()
int_df.isna().sum()

roads             0
protected_turn    0
total_lanes       0
medians           0
exit              0
split             0
way               0
underpass         0
no_left           0
angled            0
triangle          0
one_way           0
turn_lanes        0
lat               0
long              0
rlc               0
intersection      0
dtype: int64

In [6]:
int_df = int_df[int_df['rlc']==1]
int_df.columns

Index(['roads', 'protected_turn', 'total_lanes', 'medians', 'exit', 'split',
       'way', 'underpass', 'no_left', 'angled', 'triangle', 'one_way',
       'turn_lanes', 'lat', 'long', 'rlc', 'intersection'],
      dtype='object')

In [7]:
cols_toint = ['protected_turn', 'total_lanes', 'medians', 'exit', 'split',
       'way', 'underpass', 'no_left', 'angled', 'triangle', 'one_way',
       'turn_lanes', 'rlc']
cols_tofloat = ['lat', 'long',]

int_df[cols_toint] = int_df[cols_toint].astype(int)
int_df[cols_tofloat] = int_df[cols_tofloat].astype(float)
int_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 183 entries, 111TH AND HALSTED to WESTERN AND TOUHY
Data columns (total 17 columns):
roads             183 non-null object
protected_turn    183 non-null int64
total_lanes       183 non-null int64
medians           183 non-null int64
exit              183 non-null int64
split             183 non-null int64
way               183 non-null int64
underpass         183 non-null int64
no_left           183 non-null int64
angled            183 non-null int64
triangle          183 non-null int64
one_way           183 non-null int64
turn_lanes        183 non-null int64
lat               183 non-null float64
long              183 non-null float64
rlc               183 non-null int64
intersection      183 non-null object
dtypes: float64(2), int64(13), object(2)
memory usage: 25.7+ KB


#### Now bring in my count information 

In [8]:
    
daily_traffic = client.get("pfsx-4n4m", 
                     limit=2000,
                    )

daily_traffic = pd.DataFrame.from_records(daily_traffic) # Convert to pandas DataFrame

In [9]:
daily_traffic.info()
cols_tokeep = ['traffic_volume_count_location_address', 'total_passing_vehicle_volume',]
daily_traffic = daily_traffic[cols_tokeep]
daily_traffic.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1279 entries, 0 to 1278
Data columns (total 15 columns):
id                                             1279 non-null object
traffic_volume_count_location_address          1279 non-null object
street                                         1279 non-null object
date_of_count                                  1279 non-null object
total_passing_vehicle_volume                   1279 non-null object
vehicle_volume_by_each_direction_of_traffic    1279 non-null object
latitude                                       1279 non-null object
longitude                                      1279 non-null object
location                                       1279 non-null object
:@computed_region_rpca_8um6                    1266 non-null object
:@computed_region_vrxf_vc4k                    1266 non-null object
:@computed_region_6mkv_f3dw                    1279 non-null object
:@computed_region_bdys_3d7i                    1265 non-null object
:@compute

Unnamed: 0,traffic_volume_count_location_address,total_passing_vehicle_volume
0,5838 West,7100
1,320 East,8600
2,1730 East,53500
3,125 East,700
4,2924 East,4200


In [10]:
daily_traffic.total_passing_vehicle_volume = daily_traffic.total_passing_vehicle_volume.astype(int)

Combine my characteristics with my daily_traffic

In [11]:
#int_chars.merge(daily_traffic, left_on='intersection', right_on='rkey')
def look_up_roads(road_list):
    total = 0
    #print(road_list)
    for road in road_list:
        count = daily_traffic[daily_traffic['traffic_volume_count_location_address']==road]['total_passing_vehicle_volume'].sum()
        total += count
        #print(count)
    return total

int_df['daily_traffic'] = int_df['roads'].apply(look_up_roads)
int_df.drop(columns=['roads'], inplace=True)

In [12]:

int_df.head(100)

Unnamed: 0,protected_turn,total_lanes,medians,exit,split,way,underpass,no_left,angled,triangle,one_way,turn_lanes,lat,long,rlc,intersection,daily_traffic
111TH AND HALSTED,2,6,2,0,0,4,0,0,1,0,0,2,41.692362,-87.642423,1,111TH AND HALSTED,43100
115TH AND HALSTED,4,6,2,0,0,4,0,0,0,0,0,4,41.685089,-87.642094,1,115TH AND HALSTED,42500
119TH AND HALSTED,4,6,2,0,0,4,0,0,0,0,0,4,41.677774,-87.641930,1,119TH AND HALSTED,41800
31ST AND CALIFORNIA,2,6,0,0,0,4,0,0,0,0,0,4,41.837424,-87.695022,1,31ST AND CALIFORNIA,43100
31ST ST AND MARTIN LUTHER KING DRIVE,2,10,2,0,1,4,0,2,0,0,0,0,41.838441,-87.617338,1,31ST ST AND MARTIN LUTHER KING DRIVE,46100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
HAMLIN AND LAKE,0,4,0,0,0,4,2,0,0,0,0,0,41.885239,-87.720914,1,HAMLIN AND LAKE,28600
HAMLIN AND MADISON,0,6,1,0,0,4,0,0,0,0,0,4,41.880783,-87.720722,1,HAMLIN AND MADISON,34800
HARLEM AND BELMONT,4,4,0,0,0,4,0,0,0,0,0,4,41.937995,-87.806770,1,HARLEM AND BELMONT,53000
HARLEM AND NORTHWEST HWY,2,4,0,0,0,4,0,1,1,1,0,2,41.997015,-87.806839,1,HARLEM AND NORTHWEST HWY,47900


In [13]:
def make_table(df, table_name, c, conn):
    '''
    table_name string
    c cursor object
    conn sql connection object
    '''
    if table_name in sql_fetch_tables(c, conn):  # helper function in myfuncs
        delete_all_entries(c, conn, table_name) # in myfuncs
    
    df.to_sql(table_name, conn, if_exists='replace', index = False)    
    print(sql_fetch_tables(c, conn))
    


In [14]:
make_table(int_df, 'intersection_chars', c, conn)

[('int_chars',), ('intersection_counts',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersction_locations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('signal_crashes',), ('intersection_chars',)]


## 1) Build daily_violations TABLE  from rlc_df

### Get red light violation data from Socrata query

In [15]:
# Red light violations
# Takes several minutes to run and holds about 500mb in memory to build

# First 1000000 results, returned as JSON from API / converted to Python list of dictionaries by sodapy
rlc_df = client.get("spqx-js37", #speed cams are at 'hhkd-xvj4' if you want to investigate?
                     #where='violation_date > 01-01-2020',
                     where='violation_date BETWEEN \'2015-01-01T00:00:00.000\' AND \'2020-12-31T00:00:00.000\'',
                     limit=1000000,
                    )

rlc_df = pd.DataFrame.from_records(rlc_df) # Convert to pandas DataFrame

### Preprocess Red Light Camera Data

Data Columns of interest:

INTERSECTION -
Intersection of the location of the red light enforcement camera(s). There may be more than one camera at each intersection. Plain Text

CAMERA ID -
A unique ID for each physical camera at an intersection, which may contain more than one camera. Plain Text

ADDRESS	-
The address of the physical camera (CAMERA ID). The address may be the same for all cameras or different, based on the physical installation of each camera. Plain Text

VIOLATION DATE -
The date of when the violations occurred. NOTE: The citation may be issued on a different date. Date & Time

VIOLATIONS - 
Number of violations for each camera on a particular day. Number

LATITUDE -
The latitude of the physical location of the camera(s) based on the ADDRESS column. Geocoded using the WGS84. Number

LONGITUDE -
The longitude of the physical location of the camera(s) based on the ADDRESS column. Geocoded using the WGS84.
Number

#### Investigate rlc_df

In [16]:
rlc_df.info()
rlc_df.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 572559 entries, 0 to 572558
Data columns (total 10 columns):
intersection      572559 non-null object
camera_id         572325 non-null object
address           572559 non-null object
violation_date    572559 non-null object
violations        572559 non-null object
x_coordinate      542736 non-null object
y_coordinate      542736 non-null object
latitude          542736 non-null object
longitude         542736 non-null object
location          542736 non-null object
dtypes: object(10)
memory usage: 43.7+ MB


intersection          0
camera_id           234
address               0
violation_date        0
violations            0
x_coordinate      29823
y_coordinate      29823
latitude          29823
longitude         29823
location          29823
dtype: int64

#### Drop nan values and unnecessary columns
We see that we have all text/non-null objects.  Need to convert first before manipulating for preprocess.

There are a fair number of missing locations/lat/long.  Hope to be able to replace those missing values.
This represents a large enough portion of dataset that we should look them up.

The na values for camera_id will have to be dropped, since we don't know what they are.

We will not be using x andy y_coordinate, so we drop those.  We will also drop location.  We already have lat long in other columns.

In [17]:
#client_df.dropna(subset=['camera_id']).isna().sum()
try:
    # put this is a try in case we run it twice, it will skip it.
    rlc_df.dropna(subset=['camera_id'], inplace=True)
    
    # drop xy coord and location columns
    rlc_df = rlc_df.drop(columns=['x_coordinate', 'y_coordinate', 'location'], index=1)
except:
    pass



rlc_df.isna().sum()

intersection          0
camera_id             0
address               0
violation_date        0
violations            0
latitude          29809
longitude         29809
dtype: int64

#### Manipulate datatypes for preprocessing. 

In [20]:
rlc_df['violations'] = rlc_df['violations'].apply(int)
rlc_df['latitude'] = rlc_df['latitude'].apply(float)
rlc_df['longitude'] = rlc_df['longitude'].apply(float)
rlc_df['violation_date'] = pd.to_datetime(rlc_df['violation_date'])
rlc_df['month'] = rlc_df['violation_date'].apply(lambda x: int(x.month))
rlc_df['day'] = rlc_df['violation_date'].apply(lambda x: int(x.day))  # fixed from dat to day!

rlc_df['weekday'] = rlc_df['violation_date'].apply(lambda x: int(datetime.weekday(x)))
rlc_df['year'] = rlc_df['violation_date'].apply(lambda x: int(x.year))

rlc_df.head()

Unnamed: 0,intersection,camera_id,address,violation_date,violations,latitude,longitude,month,day,weekday,year
0,IRVING PARK AND KILPATRICK,2763,4700 W IRVING PARK ROA,2015-04-09,4,,,4,9,3,2015
2,115TH AND HALSTED,2552,11500 S HALSTED STREE,2015-04-08,5,,,4,8,2,2015
3,IRVING PARK AND KILPATRICK,2764,4700 W IRVING PARK ROA,2015-04-19,4,,,4,19,6,2015
4,ELSTON AND IRVING PARK,1503,3700 W IRVING PARK ROA,2015-04-23,3,,,4,23,3,2015
5,4700 WESTERN,2141,4700 S WESTERN AVENUE,2019-06-05,3,41.808378,-87.684571,6,5,2,2019


### Write the daily_violations TABLE - from rlc_df

In [21]:
def make_table(df, table_name, c, conn):
    '''
    table_name string
    c cursor object
    conn sql connection object
    '''
    if table_name in sql_fetch_tables(c, conn):  # helper function in myfuncs
        delete_all_entries(c, conn, table_name) # in myfuncs
    
    df.to_sql(table_name, conn, if_exists='replace', index = False)    
    print(sql_fetch_tables(c, conn))
    


In [22]:
make_table(rlc_df, 'daily_violations', c, conn)

[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('int_chars',), ('intersection_counts',), ('intersction_locations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('cam_locations',), ('cam_startend',), ('intersection_chars',), ('daily_violations',)]


## 2) Build cam_locations TABLE - from cam_locs AND cam_startend TABLE from cam_startend

#### Make a df with info for each camera
Will contain the following:
- camera_id
- location
- start date (when was the camera turned on)
- end date (when was the camera turned off)

In [54]:
cam_df = rlc_df.copy()
cam_df['start'] = cam_df['camera_id'].apply(lambda x: None)
cam_df['end'] = cam_df['camera_id'].apply(lambda x: None)

In [55]:
cam_start = cam_df.groupby(['camera_id'])['violation_date'].min().reset_index()
cam_end = cam_df.groupby(['camera_id'])['violation_date'].max().reset_index()

cam_startend = cam_start.copy()

#print(cam_end[cam_end['camera_id']=='1503'].values[0][1])  # for testing output
cam_startend['end'] = cam_start['camera_id'].apply(lambda x: cam_end[cam_end['camera_id']==x].values[0][1])

cam_startend.rename(columns={"violation_date": "start"}, inplace=True)
                                                   
print('NA values in cam_startend:', cam_startend.isna().sum(), end='\n\n', sep='\n')

print('Describe cam_startend:', cam_startend.describe(), end='\n\n', sep='\n')

cam_startend.head()

NA values in cam_startend:
camera_id    0
start        0
end          0
dtype: int64

Describe cam_startend:
       camera_id                start                  end
count        363                  363                  363
unique       363                   18                   19
top         1413  2015-01-01 00:00:00  2020-12-31 00:00:00
freq           1                  285                  243
first        NaN  2015-01-01 00:00:00  2015-03-02 00:00:00
last         NaN  2018-03-05 00:00:00  2020-12-31 00:00:00



Unnamed: 0,camera_id,start,end
0,1002,2015-01-01,2020-12-31
1,1003,2015-01-01,2020-12-31
2,1011,2015-01-02,2020-12-31
3,1014,2015-01-01,2020-12-31
4,1023,2015-01-02,2020-12-31


## Make a db table that has camera locations and intersections
Intersections are present (and addresses), but we do not have lat/long info for all cams

In [56]:
rlc_df.groupby('camera_id')['latitude'].max()  # Some cams do not have any data for lat long at all


# Some of the addresses are truncated and not able to lookup with geocode

address_fix = {'2400 W VAN BUREN STREE': '2400 W VAN BUREN STREET',
               '4700 W IRVING PARK ROA': '4700 W IRVING PARK ROAD',
               '11500 S HALSTED STREE': '11500 S HALSTED STREET',
               '5500 S WENTWORTH AVEN': '5500 S WENTWORTH AVENUE',
                '10300 S HALSTED STREE': '10300 S HALSTED STREET',
               '3700 W IRVING PARK ROA': '3700 W IRVING PARK ROAD',
               '1600 W IRVING PARK ROA': '1600 W IRVING PARK ROAD',
               '7900 S JEFFERY BOULEV': '7900 S JEFFERY BOULEVARD',
               '2800 W IRVING PARK ROA': '2800 W IRVING PARK ROAD',
               '5200 W IRVING PARK ROA': '5200 W IRVING PARK ROAD',
               '3100 S DR MARTIN L KING': '3100 S MARTIN KING DRIVE',
               '1600 W DIVERSEY PARKWA': '1600 W DIVERSEY PARKWAY',
               '140 W KINZIE': '140 W Kinzie St',
                '150 N SACRAMENTO BOUL': '150 N SACRAMENTO BOUL',
               '800 N SACRAMENTO AVEN':'800 N SACRAMENTO AVENUE',
               '3200 N LAKESHORE DRIV':'3200 N LAKE SHORE DRIVE',
               '6400 W FULLERTON AVENU':'6400 W FULLERTON AVENUE',
               '6400 N MILWAUKEE AVEN':'6400 N MILWAUKEE AVENUE',
               '7900 S STONEY ISLAND':'7900 S Stony Island Ave',  
               '150 N SACRAMENTO BOUL':'150 N SACRAMENTO BOULEVARD',
                '3200 N LAKESHORE DRIVE':'3200 N Lake Shore Dr',
               '7900 S STONEY ISLAND AVENUE':'7900 S Stony Island Ave',
               '5600 W FULLERTON AVENU':'5600 W FULLERTON AVENUE',
               '8700 S LAFAYETTE AVEN':'8700 S LAFAYETTE AVENUE',
               '4400 N MILWAUKEE AVEN':'4400 N MILWAUKEE AVENUE',
              }

In [57]:
# we had some incorrect data in the code below, but have a creative fix.

cam_locs = rlc_df.groupby(['camera_id', 'intersection']).max().reset_index()
cam_locs.head()

# we find there is a mismatch between lens, one of them is duplicated
len(cam_locs)  # 364 total
len(cam_locs['camera_id'].unique()) # 363

cam_locs[cam_locs['camera_id'].duplicated()]  # 1421 is dupe
print('Two of them\n', cam_locs[cam_locs['camera_id'] == '1421'])  # we see two of them
print()

# Which one is it?
print('Damen/Diversey', rlc_df[(rlc_df['camera_id']=='1421') & (rlc_df['intersection']=='DAMEN AND DIVERSEY')]['camera_id'].count())
print('Laramie/Fullerton:', rlc_df[(rlc_df['camera_id']=='1421') & (rlc_df['intersection']=='LARAMIE AND FULLERTON')]['camera_id'].count())

# Turns out that a camera has two locations. One was only used one time.  We drop it.
cam_locs = cam_locs[(cam_locs['camera_id']!='1421') | (cam_locs['intersection']!='DAMEN AND DIVERSEY')]
print("Total cams", len(cam_locs))  # 363 total (got rid of the bad one)

Two of them
    camera_id           intersection                  address violation_date  \
83      1421     DAMEN AND DIVERSEY  2000 W DIVERSEY PARKWAY     2017-11-30   
84      1421  LARAMIE AND FULLERTON    2400 N LARAMIE AVENUE     2020-12-31   

    violations   latitude  longitude  month  day  weekday  year  
83           1  41.932394 -87.678173     11   30        3  2017  
84           6  41.924152 -87.756295     12   31        6  2020  

Damen/Diversey 1
Laramie/Fullerton: 1219
Total cams 363


In [58]:
cam_locs.isna().sum()  # missing location for 19 cameras.  Let's fix it

camera_id          0
intersection       0
address            0
violation_date     0
violations         0
latitude          19
longitude         19
month              0
day                0
weekday            0
year               0
dtype: int64

Looks like we also are missing 19 of the 363 cam locations.  Let's look it up!

In [59]:

'''
This section goes through all of the rlc and assigns latlong
Many lights are missing it.  
For each light, there is an address though.
We use geocoding to get the latlong
'''

# let's get all of the red light cameras with their gps location.  
# This will aid in placing the accidents at rlc intersections later (if closer than threshold point to point)
# Some RLCs are missing location data,  but have addresses.  I can use geocoding I guess to look them up.


geolocator = Nominatim(user_agent="https://github.com/sciencelee/chicago_rlc")  # please change to match repo

# Some example code
#location = geolocator.geocode("175 5th Avenue NYC")
#print(location.address)
# out: Flatiron Building, 175, 5th Avenue, Flatiron, New York, NYC, New York, ...

#print((location.latitude, location.longitude))
# out: (40.7410861, -73.9896297241625)

#print(location.raw)
# out: {'place_id': '9167009604', 'type': 'attraction', ...}


# CAN USE THIS TO FIGURE OUT MY LAT LONG FROM RLC ADDRESS (or crash later)   
 
def get_geocode(lat, long, address):
    if lat > 0:  # it's a location
        return (lat, long)
    else: # it's a proper location tuple, and assumed to be correct latlong
        if address in address_fix.keys(): address = address_fix[address]  # errors in the dataset chars omitted
        # if we make it this far, we have no record of this cam_id yet, and it doesn't have a proper location
        location = geolocator.geocode(address + ', Chicago, IL')
        if location == None:
            print(address+':'+address+' : could not geolocate') # print it out if we can't find (address errors)
        else:
            return (location.latitude, location.longitude)

        


# Got it down to one-liner for this. Found out you can't extract and assign series like you can variables
cam_locs['location'] = cam_locs.apply(lambda x: get_geocode(x.latitude, x.longitude, x.address), axis=1)

In [60]:
cam_locs['location'].head()
cam_locs['latitude'] = cam_locs['location'].apply(lambda x: x[0])
cam_locs['longitude'] = cam_locs['location'].apply(lambda x: x[1])

cam_locs = cam_locs.drop(columns=['violation_date', 'violations', 'month', 'weekday', 'year', 'location'])

In [61]:
cam_locs.info()
cam_locs.isna().sum() # No longer missing location for 19 cameras.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 363 entries, 0 to 363
Data columns (total 6 columns):
camera_id       363 non-null object
intersection    363 non-null object
address         363 non-null object
latitude        363 non-null float64
longitude       363 non-null float64
day             363 non-null int64
dtypes: float64(2), int64(1), object(3)
memory usage: 19.9+ KB


camera_id       0
intersection    0
address         0
latitude        0
longitude       0
day             0
dtype: int64

### Lat long fixes
During EDA, we found out that five cameras were in completely wrong lat/long location.  
Several others were located a little too far from the intersection to work properly.  When we rebuild the db, we will use bigger number than 30 m.

Here are the fixes I found easily.

In [62]:
correct_camlocs = {'WENTWORTH AND GARFIELD': (41.79435532184194, -87.63114616279303),
                   'KIMBALL AND LINCOLN': (41.99454797825689, -87.71403619467101),
                    'IRVING PARK AND LARAMIE': (41.95330280770521, -87.75714705140294),
                   'IRVING PARK AND KILPATRICK': (41.953454972337305, -87.74460631799197),
                   '31ST ST AND MARTIN LUTHER KING DRIVE':(41.838438059816816, -87.61731906497867),
                   '31ST AND CALIFORNIA':(41.83743605501678, -87.6950324427879),
                   'ELSTON AND LAWRENCE': (41.96809435761252, -87.74010862196117),
                   'OGDEN AND KOSTNER':(41.84767736575193, -87.73437725628261),
                    'IRVING PARK AND CALIFORNIA':(41.95399037931945, -87.69821479681646),
                   'LAKE SHORE DR AND BELMONT': (41.940128786177176, -87.63954362976928),
                     }




In [63]:
cam_locs.head()

Unnamed: 0,camera_id,intersection,address,latitude,longitude,day
0,1002,WESTERN AND CERMAK,2200 S WESTERN AVENUE,41.851984,-87.685786,31
1,1003,WESTERN AND CERMAK,2400 W CERMAK ROAD,41.852141,-87.685753,31
2,1011,PETERSON AND WESTERN,6000 N WESTERN AVE,41.990586,-87.689822,31
3,1014,PETERSON AND WESTERN,2400 W PETERSON,41.990609,-87.689735,31
4,1023,IRVING PARK AND NARRAGANSETT,6400 W IRVING PK,41.953025,-87.786683,31


In [64]:
print(cam_locs.columns)

print(correct_camlocs['WENTWORTH AND GARFIELD'][0])
print(correct_camlocs.keys())
print('WENTWORTH AND GARFIELD' in correct_camlocs.keys())
#cam_locs['latitude'] = cam_locs.apply(lambda x: correct_camlocs[x['intersection']][0] if x['intersection'] in correct_camlocs.keys() else x['latitude'], axis=1)
#cam_locs['longitude'] = cam_locs.apply(lambda x: correct_camlocs[x['intersection']][1] if x['intersection'] in correct_camlocs.keys() else x['longitude'], axis=1)

cam_locs.head()

Index(['camera_id', 'intersection', 'address', 'latitude', 'longitude', 'day'], dtype='object')
41.79435532184194
dict_keys(['WENTWORTH AND GARFIELD', 'KIMBALL AND LINCOLN', 'IRVING PARK AND LARAMIE', 'IRVING PARK AND KILPATRICK', '31ST ST AND MARTIN LUTHER KING DRIVE', '31ST AND CALIFORNIA', 'ELSTON AND LAWRENCE', 'OGDEN AND KOSTNER', 'IRVING PARK AND CALIFORNIA', 'LAKE SHORE DR AND BELMONT'])
True


Unnamed: 0,camera_id,intersection,address,latitude,longitude,day
0,1002,WESTERN AND CERMAK,2200 S WESTERN AVENUE,41.851984,-87.685786,31
1,1003,WESTERN AND CERMAK,2400 W CERMAK ROAD,41.852141,-87.685753,31
2,1011,PETERSON AND WESTERN,6000 N WESTERN AVE,41.990586,-87.689822,31
3,1014,PETERSON AND WESTERN,2400 W PETERSON,41.990609,-87.689735,31
4,1023,IRVING PARK AND NARRAGANSETT,6400 W IRVING PK,41.953025,-87.786683,31


### Create cam_locations TABLE - from cam_locs AND cam_startend from cam_startend

In [65]:
make_table(cam_locs, 'cam_locations', c, conn)
make_table(cam_startend, 'cam_startend', c, conn)

[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('int_chars',), ('intersection_counts',), ('intersction_locations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('intersection_chars',), ('daily_violations',), ('cam_startend',), ('cam_locations',)]
[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('int_chars',), ('intersection_counts',), ('intersction_locations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('intersection_chars',), ('daily_violations',), ('cam_locations',), ('cam_startend',)]


###  We still have missing lat/long info for our rlc_df.  Let's fix it
Before moving on.  Now that we have cam_locs, we can fix our rlc_df

In [66]:
rlc_df.isna().sum()  
cam_locs.head()

Unnamed: 0,camera_id,intersection,address,latitude,longitude,day
0,1002,WESTERN AND CERMAK,2200 S WESTERN AVENUE,41.851984,-87.685786,31
1,1003,WESTERN AND CERMAK,2400 W CERMAK ROAD,41.852141,-87.685753,31
2,1011,PETERSON AND WESTERN,6000 N WESTERN AVE,41.990586,-87.689822,31
3,1014,PETERSON AND WESTERN,2400 W PETERSON,41.990609,-87.689735,31
4,1023,IRVING PARK AND NARRAGANSETT,6400 W IRVING PK,41.953025,-87.786683,31


## Decided to eliminate cam position in favor of intersection lat/long
I hope this makes all of my position data consistent for gathering crash info.
When using cam location, it is sometimes up to 35 m up road where cam position is.  This would cause us to misidentify crashes from other intersections or miss some in the intersection of interest.  R

Remedy: Use center point of intersection for all cams.

How to do it:  I will change cam_locs data to match intersection instead of individual camera.

In [67]:
int_df.columns


Index(['protected_turn', 'total_lanes', 'medians', 'exit', 'split', 'way',
       'underpass', 'no_left', 'angled', 'triangle', 'one_way', 'turn_lanes',
       'lat', 'long', 'rlc', 'intersection', 'daily_traffic'],
      dtype='object')

In [97]:
def location_correction(int_df, intersect, latlong):
    # lookup function from intersection df to get the lat long
    # int_df is the intersection characteristic frame from 1) above
    # intersect is the intersection name used to link tables/df
    # latlong is either 'lat' or 'long'
    if latlong == 'lat':
        lat = int_df[int_df['intersection']==intersect]['lat'].values[0]
        if lat==None: print(lat, intersect)
        return lat
    else:
        long = int_df[int_df['intersection']==intersect]['long'].values[0]
        return long

cam_locs['latitude'] = cam_locs['intersection'].apply(lambda x: location_correction(int_df, x, 'lat'))
cam_locs['longitude'] = cam_locs['intersection'].apply(lambda x: location_correction(int_df, x, 'long'))

In [98]:
rlc_df.intersection.head()

0    IRVING PARK AND KILPATRICK
2             115TH AND HALSTED
3    IRVING PARK AND KILPATRICK
4        ELSTON AND IRVING PARK
5                  4700 WESTERN
Name: intersection, dtype: object

In [99]:
int_df[int_df['intersection']=='IRVING PARK AND KILPATRICK']

Unnamed: 0,protected_turn,total_lanes,medians,exit,split,way,underpass,no_left,angled,triangle,one_way,turn_lanes,lat,long,rlc,intersection,daily_traffic
IRVING PARK AND KILPATRICK,1,6,0,0,0,4,0,0,0,0,1,3,41.953395,-87.744635,1,IRVING PARK AND KILPATRICK,37100


In [103]:
#⏳⏳⏳⏳⏳⏳
# THIS TAKES SOME TIME (8min on my macbook pro)
def read_loc(int_df, intersection):
    cam = int_df[int_df['intersection']==intersection]
    #print(cam)
    return (float(cam['lat']), float(cam['long']))
        


# create a location column so we only have to do it once
rlc_df['location'] = rlc_df['intersection'].apply(lambda x: read_loc(int_df, x))
rlc_df.head()

Unnamed: 0,intersection,camera_id,address,violation_date,violations,latitude,longitude,month,day,weekday,year,location
0,IRVING PARK AND KILPATRICK,2763,4700 W IRVING PARK ROA,2015-04-09,4,41.953395,-87.744635,4,9,3,2015,"(41.95339512528642, -87.74463497408772)"
2,115TH AND HALSTED,2552,11500 S HALSTED STREE,2015-04-08,5,41.685089,-87.642094,4,8,2,2015,"(41.68508889165794, -87.64209428512216)"
3,IRVING PARK AND KILPATRICK,2764,4700 W IRVING PARK ROA,2015-04-19,4,41.953395,-87.744635,4,19,6,2015,"(41.95339512528642, -87.74463497408772)"
4,ELSTON AND IRVING PARK,1503,3700 W IRVING PARK ROA,2015-04-23,3,41.953778,-87.719161,4,23,3,2015,"(41.9537776271701, -87.71916140535721)"
5,4700 WESTERN,2141,4700 S WESTERN AVENUE,2019-06-05,3,41.808442,-87.684183,6,5,2,2019,"(41.808442084381, -87.68418270817706)"


In [104]:
rlc_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 572324 entries, 0 to 572558
Data columns (total 12 columns):
intersection      572324 non-null object
camera_id         572324 non-null object
address           572324 non-null object
violation_date    572324 non-null datetime64[ns]
violations        572324 non-null int64
latitude          572324 non-null float64
longitude         572324 non-null float64
month             572324 non-null int64
day               572324 non-null int64
weekday           572324 non-null int64
year              572324 non-null int64
location          572324 non-null object
dtypes: datetime64[ns](1), float64(2), int64(5), object(4)
memory usage: 76.8+ MB


In [105]:
rlc_df[rlc_df.latitude.isna()]['intersection'].unique()  # which intersections am I still missing

array([], dtype=object)

In [106]:
# then add in the new lat longs to the df
rlc_df['latitude'] = rlc_df['location'].apply(lambda x: x[0])
rlc_df['longitude'] = rlc_df['location'].apply(lambda x: x[1])

In [107]:
rlc_df.head()

Unnamed: 0,intersection,camera_id,address,violation_date,violations,latitude,longitude,month,day,weekday,year,location
0,IRVING PARK AND KILPATRICK,2763,4700 W IRVING PARK ROA,2015-04-09,4,41.953395,-87.744635,4,9,3,2015,"(41.95339512528642, -87.74463497408772)"
2,115TH AND HALSTED,2552,11500 S HALSTED STREE,2015-04-08,5,41.685089,-87.642094,4,8,2,2015,"(41.68508889165794, -87.64209428512216)"
3,IRVING PARK AND KILPATRICK,2764,4700 W IRVING PARK ROA,2015-04-19,4,41.953395,-87.744635,4,19,6,2015,"(41.95339512528642, -87.74463497408772)"
4,ELSTON AND IRVING PARK,1503,3700 W IRVING PARK ROA,2015-04-23,3,41.953778,-87.719161,4,23,3,2015,"(41.9537776271701, -87.71916140535721)"
5,4700 WESTERN,2141,4700 S WESTERN AVENUE,2019-06-05,3,41.808442,-87.684183,6,5,2,2019,"(41.808442084381, -87.68418270817706)"


In [108]:
if 'location' in rlc_df.columns:
    rlc_df.drop(columns=['location'], inplace=True)

In [109]:
make_table(rlc_df, 'daily_violations', c, conn)

[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('int_chars',), ('intersection_counts',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('intersction_locations',), ('daily_violations',)]


## 3) intersection_locations TABLE - from intersection_df  (NO LONGER NEEDED)

## Add a df of intersections with lat long
This should help us later determine if crash is at intersection

I chose to groupby the intersection and aggregate the most commonly occuring lat/long value

In [112]:
#results_df.groupby(['intersection', 'latitude', 'longitude']).reset_index()
intersection_df = rlc_df.groupby(['intersection']).agg({'latitude':pd.Series.mode,'longitude':pd.Series.mode,}).reset_index()

In [113]:
intersection_df.info()
intersection_df.head()  # that was easy!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183 entries, 0 to 182
Data columns (total 3 columns):
intersection    183 non-null object
latitude        183 non-null float64
longitude       183 non-null float64
dtypes: float64(2), object(1)
memory usage: 4.4+ KB


Unnamed: 0,intersection,latitude,longitude
0,111TH AND HALSTED,41.692362,-87.642423
1,115TH AND HALSTED,41.685089,-87.642094
2,119TH AND HALSTED,41.677774,-87.64193
3,31ST AND CALIFORNIA,41.837424,-87.695022
4,31ST ST AND MARTIN LUTHER KING DRIVE,41.838441,-87.617338


### Create intersection_locations TABLE - from intersection_df

In [114]:
make_table(intersection_df, 'intersction_locations', c, conn)

[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('int_chars',), ('intersection_counts',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersction_locations',)]


## 4) Create intersection_cams TABLE - from int_cams

### We will now focus on trying to bring rlc intersections to our crashes
We find that we have 363 cameras at 183 intersections

In [115]:
cam_locs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 363 entries, 0 to 363
Data columns (total 6 columns):
camera_id       363 non-null object
intersection    363 non-null object
address         363 non-null object
latitude        363 non-null float64
longitude       363 non-null float64
day             363 non-null int64
dtypes: float64(2), int64(1), object(3)
memory usage: 19.9+ KB


In [116]:
int_cams = cam_locs.groupby(['intersection']) \
                    .agg({'latitude':pd.Series.max, 'longitude':pd.Series.max,}) \
                    .reset_index()

int_cams['cam1'] = int_cams['intersection'] \
                            .apply(lambda x: cam_locs[cam_locs['intersection']==x]['camera_id'].iloc[0])

int_cams['cam2'] = int_cams['intersection'].apply( \
                            lambda x: None if len(cam_locs[cam_locs['intersection']==x])==1 \
                            else cam_locs[cam_locs['intersection']==x]['camera_id'].iloc[1])

int_cams['cam3'] = int_cams['intersection'].apply( \
                            lambda x: None if len(cam_locs[cam_locs['intersection']==x])<3 \
                            else cam_locs[cam_locs['intersection']==x]['camera_id'].iloc[2])                             

int_cams.head()






Unnamed: 0,intersection,latitude,longitude,cam1,cam2,cam3
0,111TH AND HALSTED,41.692362,-87.642423,2422,2424,
1,115TH AND HALSTED,41.685089,-87.642094,2552,2553,
2,119TH AND HALSTED,41.677774,-87.64193,2402,2404,
3,31ST AND CALIFORNIA,41.837424,-87.695022,2061,2064,
4,31ST ST AND MARTIN LUTHER KING DRIVE,41.838441,-87.617338,2121,2123,


In [117]:
print('Total Cameras', len(cam_locs))
print('Total Intersections', len(int_cams))

Total Cameras 363
Total Intersections 183


### Create intersection_cams TABLE - from int_cams
first we add a column for region to each of my intersections
#### Should come back and add this later.  Need to also bring in congestion data though.

In [118]:
make_table(int_cams, 'intersection_cams', c, conn)

[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('int_chars',), ('intersection_counts',), ('all_crashes',), ('signal_crashes',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersction_locations',), ('intersection_cams',)]


## 5) Create signal_crashes TABLE - from crash_df AND all_crashes - from crash_df (pre)

In [119]:
# Crash Data
crash_data = client.get("85ca-t3if", 
                     where="crash_date BETWEEN \'2015-01-01T00:00:00.000\' AND \'2020-12-31T00:00:00.000\'",
                     limit=1000000,
                    )

crash_df = pd.DataFrame.from_records(crash_data) # Convert to pandas DataFrame

### Crash data preprocessing

In [120]:
# drop a few columns we don't need, including location (we have lat/long)
dropme = ['statements_taken_i', 'private_property_i', 'photos_taken_i', 'dooring_i', 'date_police_notified','location']

crash_df.drop(columns=dropme, inplace=True)

In [121]:
crash_df.isna().sum()

crash_record_id                       0
rd_no                                 0
crash_date                            0
posted_speed_limit                    0
traffic_control_device                0
device_condition                      0
weather_condition                     0
lighting_condition                    0
first_crash_type                      0
trafficway_type                       0
alignment                             0
roadway_surface_cond                  0
road_defect                           0
report_type                       11277
crash_type                            0
damage                                0
prim_contributory_cause               0
sec_contributory_cause                0
street_no                             0
street_direction                      3
street_name                           1
beat_of_occurrence                    5
num_units                             0
most_severe_injury                  943
injuries_total                      932


We have 2.5k entries that have no location.  Let's drop them

In [122]:
crash_df.dropna(subset=['latitude',], inplace=True)  # get rid of na locations

### Let's look at what is in the data   

In [123]:
# What's in this data?
col_interest = ['traffic_control_device', 'device_condition', 'weather_condition',
       'lighting_condition', 'first_crash_type', 'trafficway_type',
       'alignment', 'roadway_surface_cond', 'road_defect', 'report_type',
       'crash_type', 'hit_and_run_i', 'damage', 'prim_contributory_cause',
       'sec_contributory_cause', 'street_no', 'street_direction',
       'street_name', 'beat_of_occurrence', 'num_units', 'most_severe_injury', 
        'injuries_fatal', 'injuries_incapacitating',
       'injuries_non_incapacitating', 'injuries_reported_not_evident',
       'injuries_no_indication', 'injuries_unknown', 'crash_hour',
       'crash_day_of_week', 'crash_month', 'latitude', 'longitude', 'lane_cnt',
       'intersection_related_i', 'crash_date_est_i',
       'work_zone_i', 'work_zone_type',
       'workers_present_i']

for col in col_interest:
    print(col, crash_df[col].unique())

traffic_control_device ['NO CONTROLS' 'STOP SIGN/FLASHER' 'TRAFFIC SIGNAL' 'UNKNOWN'
 'OTHER REG. SIGN' 'LANE USE MARKING' 'POLICE/FLAGMAN'
 'RAILROAD CROSSING GATE' 'SCHOOL ZONE' 'DELINEATORS'
 'OTHER RAILROAD CROSSING' 'FLASHING CONTROL SIGNAL' 'NO PASSING'
 'RR CROSSING SIGN' 'BICYCLE CROSSING SIGN']
device_condition ['NO CONTROLS' 'FUNCTIONING PROPERLY' 'NOT FUNCTIONING' 'UNKNOWN' 'OTHER'
 'FUNCTIONING IMPROPERLY' 'WORN REFLECTIVE MATERIAL' 'MISSING']
weather_condition ['CLEAR' 'RAIN' 'UNKNOWN' 'CLOUDY/OVERCAST' 'SNOW' 'SLEET/HAIL'
 'FOG/SMOKE/HAZE' 'FREEZING RAIN/DRIZZLE' 'OTHER' 'BLOWING SNOW'
 'SEVERE CROSS WIND GATE' 'BLOWING SAND, SOIL, DIRT']
lighting_condition ['DAYLIGHT' 'DARKNESS' 'DARKNESS, LIGHTED ROAD' 'UNKNOWN' 'DAWN' 'DUSK']
first_crash_type ['TURNING' 'REAR END' 'PARKED MOTOR VEHICLE'
 'SIDESWIPE OPPOSITE DIRECTION' 'ANGLE' 'SIDESWIPE SAME DIRECTION'
 'OTHER OBJECT' 'PEDESTRIAN' 'FIXED OBJECT' 'PEDALCYCLIST' 'HEAD ON'
 'REAR TO FRONT' 'REAR TO SIDE' 'REAR TO REAR' 'A

### Filter for desired crashes (intersections with signal)
This helps us.  
We can filter 'traffic_control_device' == 'TRAFFIC SIGNAL'.  
We can filter 'intersection_related_i' == 'Y'

This will leave us with only crashes that occurred at/because of intersections, and with a signal at the intersection.

intersection_related_i: A field observation by the police officer whether an intersection played a role in the crash. Does not represent whether or not the crash occurred within the intersection.

In [124]:
make_table(crash_df, 'all_crashes', c, conn)

crash_df = crash_df[(crash_df['traffic_control_device']=='TRAFFIC SIGNAL') & \
                    (crash_df['intersection_related_i']=='Y')]

[('hourly_congestion',), ('hourly_weather',), ('region_data',), ('int_chars',), ('intersection_counts',), ('signal_crashes',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersction_locations',), ('intersection_cams',), ('all_crashes',)]


### Add the intersection to my crashes


<class 'pandas.core.frame.DataFrame'>
Index: 183 entries, 111TH AND HALSTED to WESTERN AND TOUHY
Data columns (total 17 columns):
protected_turn    183 non-null int64
total_lanes       183 non-null int64
medians           183 non-null int64
exit              183 non-null int64
split             183 non-null int64
way               183 non-null int64
underpass         183 non-null int64
no_left           183 non-null int64
angled            183 non-null int64
triangle          183 non-null int64
one_way           183 non-null int64
turn_lanes        183 non-null int64
lat               183 non-null float64
long              183 non-null float64
rlc               183 non-null int64
intersection      183 non-null object
daily_traffic     183 non-null int64
dtypes: float64(2), int64(14), object(1)
memory usage: 25.7+ KB


In [279]:
#⏳⏳⏳⏳⏳
# Now I am desperate.  This takes too long to process.  Let's simplify it and make it a box instead.
box_side = 80  # effectively makes it check for crash being within 35m of interscection
box_lat = box_side / 111070 / 2 # 111070 is meters in deg lat in Chicago
box_long = box_side / 83000 / 2 # 83000 is meters in deg long in Chicago

def box_check(lat, long, int_df):
    answer = (int_df[  (int_df['lat'] > (lat - box_lat)) & 
                      (int_df['lat'] < (lat + box_lat)) &
                      (int_df['long'] > (long - box_long)) &
                      (int_df['long'] < (long + box_long))
                     ])
    if answer.empty: return None
    return answer['intersection'].values[0]
    
# THIS SEEMS TO WORK WITH SPEED AND ELIMINATES MEMORY PROBLEM
crash_df['intersection'] = crash_df.apply(lambda x: box_check(float(x.latitude), 
                                                              float(x.longitude), 
                                                              int_df), axis=1)

In [283]:
# #Ex: df[['two', 'three']] = df[['two', 'three']].astype(float)
crash_df['crash_date'] = pd.to_datetime(crash_df['crash_date'])
crash_df['year'] = crash_df['crash_date'].apply(lambda x: int(x.year))
crash_df['month'] = crash_df['crash_date'].apply(lambda x: int(x.month))
crash_df['day'] = crash_df['crash_date'].apply(lambda x: int(x.day))
crash_df['hour'] = crash_df['crash_date'].apply(lambda x: int(x.hour))

In [284]:
crash_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60337 entries, 5 to 466380
Data columns (total 51 columns):
crash_record_id                  60337 non-null object
rd_no                            60337 non-null object
crash_date                       60337 non-null datetime64[ns]
posted_speed_limit               60337 non-null object
traffic_control_device           60337 non-null object
device_condition                 60337 non-null object
weather_condition                60337 non-null object
lighting_condition               60337 non-null object
first_crash_type                 60337 non-null object
trafficway_type                  60337 non-null object
alignment                        60337 non-null object
roadway_surface_cond             60337 non-null object
road_defect                      60337 non-null object
report_type                      58551 non-null object
crash_type                       60337 non-null object
damage                           60337 non-null object
pr

### Create signal_crashes TABLE - from crash_df

In [285]:
make_table(crash_df, 'signal_crashes', c, conn)

[('int_chars',), ('intersection_counts',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersction_locations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('signal_crashes',)]


## 6) Create hourly_congestion TABLE from all_traffic
For this one, we have to combine two different datasets.  Chicago changed the way data was recorded in 2018.  Columns are similar, but more data collected.

In [134]:
# Congestion Data
traffic_df = client.get("emtn-qqdi", 
                     #where="TIME > \'2015-01-01T00:00:00.000\'",
                     where='TIME BETWEEN \'2015-01-01T00:00:00.000\' AND \'2020-12-31T00:00:00.000\'',
                     limit=10000000,
                    )

traffic_df = pd.DataFrame.from_records(traffic_df) # Convert to pandas DataFrame

### Clean up my datatypes before preprocessing
Won't be able to table it until we get both datasets

In [135]:
traffic_df.rename(columns={'number_of_reads':'num_reads'}, inplace=True)
traffic_df['time'] = pd.to_datetime(traffic_df['time'])
traffic_df['bus_count'] = traffic_df['bus_count'].astype(int)
traffic_df['num_reads'] = traffic_df['num_reads'].astype(int)
traffic_df['speed'] = traffic_df['speed'].astype(float)

### On to the other dataset

In [136]:
# Congestion data from later
traffic_df2 = client.get("kf7e-cur8", #2018 to present
                     select='time, region_id, speed, bus_count, num_reads',  # this set is huge, so we won't get all       
                     where="TIME < \'2021-01-01T00:00:00.000\'",
                     limit=10000000,
                    )

# Convert to pandas DataFrame
traffic_df2 = pd.DataFrame.from_records(traffic_df2)


In [137]:
#traffic2_df.rename(columns={'number_of_reads':'num_reads'}, inplace=True)
traffic_df2['time'] = pd.to_datetime(traffic_df2['time'])
traffic_df2['bus_count'] = traffic_df2['bus_count'].astype(int)
traffic_df2['num_reads'] = traffic_df2['num_reads'].astype(int)
traffic_df2['speed'] = traffic_df2['speed'].astype(float)

## Now get the congestion data processed
We have two separate traffic_dfs.  There is data prior to 2018 and after in two different api endpoints.


In [138]:
traffic_df.head()
traffic_df2.head()
traffic_df2.info()
print()
traffic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3959185 entries, 0 to 3959184
Data columns (total 5 columns):
time         datetime64[ns]
region_id    object
speed        float64
bus_count    int64
num_reads    int64
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 151.0+ MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4168199 entries, 0 to 4168198
Data columns (total 5 columns):
time         datetime64[ns]
region_id    object
bus_count    int64
num_reads    int64
speed        float64
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 159.0+ MB


In [139]:
# Merge my two data sets for congestion by region
all_traffic = pd.merge(traffic_df, traffic_df2, how='outer')
print('traffic dfs merged')

traffic dfs merged


In [140]:
all_traffic['hour'] = all_traffic['time'].dt.hour
print('added hour column')

all_traffic['day'] = all_traffic.time.dt.day
print('added day column')

all_traffic['month'] = all_traffic.time.dt.month
print('added month column')

all_traffic['year'] = all_traffic.time.dt.year
print('added year column')

all_traffic['weekday'] = all_traffic.time.dt.weekday
print('added weekday column')

added hour column
added day column
added month column
added year column
added weekday column


In [141]:
print(len(all_traffic))  # lots of dupes 
all_traffic = all_traffic.groupby(['year', 'month', 'day', 'hour', 'region_id']).mean().reset_index()
all_traffic.info()

7917308
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1331680 entries, 0 to 1331679
Data columns (total 9 columns):
year         1331680 non-null int64
month        1331680 non-null int64
day          1331680 non-null int64
hour         1331680 non-null int64
region_id    1331680 non-null object
bus_count    1331680 non-null float64
num_reads    1331680 non-null float64
speed        1331680 non-null float64
weekday      1331680 non-null float64
dtypes: float64(4), int64(4), object(1)
memory usage: 91.4+ MB


In [142]:
all_traffic.head()

Unnamed: 0,year,month,day,hour,region_id,bus_count,num_reads,speed,weekday
0,2015,1,1,0,1,7.333333,90.166667,27.455,3.0
1,2015,1,1,0,10,35.666667,386.666667,25.796667,3.0
2,2015,1,1,0,11,15.833333,231.666667,25.816667,3.0
3,2015,1,1,0,12,19.166667,259.5,18.636667,3.0
4,2015,1,1,0,13,26.0,367.833333,20.681667,3.0


In [143]:
# couple minutes
make_table(all_traffic, 'hourly_congestion', c, conn)

[('hourly_weather',), ('region_data',), ('int_chars',), ('intersection_counts',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersction_locations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('hourly_congestion',)]


# Speed fix for congestion
Congestion is measured by average bus speed.

The problem:
- Overnight (between 11 and 5am) we have few bus routes.
- Some regions have no buses overnight
- Some regions have only a few buses 
- Some buses are ending routes and have only a few reads
- Some buses are stationary (next morning staging)

The fix:
- replace speed for few buses/reads if speed is low
- we assume low buses/reads to be overnight when congestion is minimal
- replacement speed is a low congestion quantile speed (90% or so)

In [144]:
## Let's get the 0.90 quantile for every region, and then use that to fill in missing data

regions_90 = all_traffic.groupby(['region_id'])['speed'].quantile(0.9).reset_index()


In [145]:
regions_90.head()


Unnamed: 0,region_id,speed
0,1,25.058333
1,10,26.385
2,11,26.8034
3,12,23.07
4,13,24.091667


In [146]:
#### 5 MINUTES OR SO

#my read on this is that few buses run 24/7, so the data is unreliable.  
# buses stage for next morning.  You can see them all along Clark, LSD etc.  
# They have speed=0 and may be recording.  Could talk to owner of dataset.

# I will draw the cutoff at 100 reads, 5 buses, speed < 10
# in that case I will put in a quantile speed for the region


def speed_check(bus, speed, reads, region_id, regions_90):
    if (bus <= 5 or reads < 100) and speed < 25 or speed > 40:
        return regions_90[regions_90['region_id']==region_id]['speed'].values[0]
    else:
        return speed
    

# apply is SLOOOOOOWWW, but not sure how else to accomplish this without iter
all_traffic['speed'] = all_traffic.apply(lambda x: speed_check(x.bus_count, x.speed, x.num_reads, x.region_id, regions_90), axis=1)
      


In [147]:
make_table(all_traffic, 'hourly_congestion', c, conn)

[('hourly_weather',), ('region_data',), ('int_chars',), ('intersection_counts',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersction_locations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('hourly_congestion',)]


## 7) Create hourly_weather from wx_df

In [148]:
# Import weather data from csv
wx_df = pd.read_csv('data/chi_wx.csv')

In [149]:
wx_df.head()

Unnamed: 0,dt,dt_iso,timezone,city_name,lat,lon,temp,feels_like,temp_min,temp_max,...,wind_deg,rain_1h,rain_3h,snow_1h,snow_3h,clouds_all,weather_id,weather_main,weather_description,weather_icon
0,1420070400,2015-01-01 00:00:00 +0000 UTC,-21600,Chicago IL. USA,41.878114,-87.629798,265.96,258.16,264.85,267.708,...,230,,,,,1,800,Clear,sky is clear,01n
1,1420074000,2015-01-01 01:00:00 +0000 UTC,-21600,Chicago IL. USA,41.878114,-87.629798,266.13,256.52,265.35,267.926,...,230,,,,,20,801,Clouds,few clouds,02n
2,1420077600,2015-01-01 02:00:00 +0000 UTC,-21600,Chicago IL. USA,41.878114,-87.629798,266.17,257.7,265.35,268.098,...,230,,,,,20,801,Clouds,few clouds,02n
3,1420081200,2015-01-01 03:00:00 +0000 UTC,-21600,Chicago IL. USA,41.878114,-87.629798,266.39,257.56,265.35,268.157,...,240,,,,,1,800,Clear,sky is clear,01n
4,1420084800,2015-01-01 04:00:00 +0000 UTC,-21600,Chicago IL. USA,41.878114,-87.629798,266.47,256.5,265.35,268.121,...,240,,,,,1,800,Clear,sky is clear,01n


In [150]:
wx_df['time'] = pd.to_datetime(wx_df['dt_iso'].apply(lambda x: x[:-4]))
wx_df.head()
wx_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55219 entries, 0 to 55218
Data columns (total 26 columns):
dt                     55219 non-null int64
dt_iso                 55219 non-null object
timezone               55219 non-null int64
city_name              55219 non-null object
lat                    55219 non-null float64
lon                    55219 non-null float64
temp                   55219 non-null float64
feels_like             55219 non-null float64
temp_min               55219 non-null float64
temp_max               55219 non-null float64
pressure               55219 non-null int64
sea_level              0 non-null float64
grnd_level             0 non-null float64
humidity               55219 non-null int64
wind_speed             55219 non-null float64
wind_deg               55219 non-null int64
rain_1h                6587 non-null float64
rain_3h                816 non-null float64
snow_1h                1538 non-null float64
snow_3h                91 non-null float6

In [151]:
wx_df['rain_3h'] = wx_df['rain_3h'].fillna(0)
wx_df['rain_1h'] = wx_df['rain_1h'].fillna(0)
wx_df['snow_3h'] = wx_df['snow_3h'].fillna(0)
wx_df['snow_1h'] = wx_df['snow_1h'].fillna(0)
wx_df['temp'] = wx_df['temp_max']
wx_df['year'] = wx_df.time.dt.year
wx_df['month'] = wx_df.time.dt.month
wx_df['day'] = wx_df.time.dt.day
wx_df['hour'] = wx_df.time.dt.hour
wx_df['weekday'] = wx_df.time.dt.weekday

In [152]:
wx_df.describe()

Unnamed: 0,dt,timezone,lat,lon,temp,feels_like,temp_min,temp_max,pressure,sea_level,...,rain_3h,snow_1h,snow_3h,clouds_all,weather_id,year,month,day,hour,weekday
count,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,0.0,...,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0,55219.0
mean,1513951000.0,-19252.525399,41.87811,-87.6298,285.518514,280.257647,281.898104,285.518514,1016.055542,,...,0.047967,0.013533,0.001809,61.031294,750.721907,2017.485087,6.395969,15.782557,11.455894,2.998443
std,53882850.0,1714.737534,7.105492e-15,4.263295e-14,11.064564,13.182997,10.354815,11.064564,7.584028,,...,0.708266,0.123137,0.066962,32.04714,112.093082,1.695629,3.444176,8.817962,6.908119,1.999746
min,1420070000.0,-21600.0,41.87811,-87.6298,245.37,233.18,242.15,245.37,965.0,,...,0.0,0.0,0.0,0.0,200.0,2015.0,1.0,1.0,0.0,0.0
25%,1467211000.0,-21600.0,41.87811,-87.6298,276.48,269.89,274.15,276.48,1011.0,,...,0.0,0.0,0.0,40.0,800.0,2016.0,3.0,8.0,5.0,1.0
50%,1514524000.0,-18000.0,41.87811,-87.6298,285.193,279.25,281.161,285.193,1016.0,,...,0.0,0.0,0.0,75.0,802.0,2017.0,6.0,16.0,11.0,3.0
75%,1560530000.0,-18000.0,41.87811,-87.6298,295.15,291.9,290.95,295.15,1021.0,,...,0.0,0.0,0.0,90.0,803.0,2019.0,9.0,23.0,17.0,5.0
max,1606864000.0,-18000.0,41.87811,-87.6298,311.48,309.89,306.132,311.48,1044.0,,...,35.0,8.4,6.0,100.0,804.0,2020.0,12.0,31.0,23.0,6.0


In [153]:
try:
    wx_df = wx_df.drop(columns=['dt', 
                        'dt_iso', 
                        'timezone', 
                        'city_name', 
                        'lat', 
                        'lon', 
                        'feels_like', 
                        'temp_min', 
                        'temp_max',
                        'pressure',
                        'sea_level',
                        'grnd_level',
                        'humidity',
                        'wind_speed',
                        'wind_deg',
                        'clouds_all',
                        'weather_description',
                        'weather_icon',
                        'weather_id',
                        'weather_main',
                       ], axis=1)
except:
    print('Failed')

In [154]:
print(len(wx_df))
print(wx_df.duplicated().sum())


print('Total hours in 6 years:', 365.25 * 24 * 6)
print('Unique entries:', len(wx_df.drop_duplicates()))  
# missing a few entries (700+ out of 52k)  Am I missing a month??

print()
print(wx_df.time.min(), wx_df.time.max())  # OH!!!!  I am missin last month
print('Total hours in 6 years (-1 mos):', 365.25 * 24 * 6 - 31 * 24)  # okay, we are only missing a few


wx_df.drop_duplicates(inplace=True)

55219
3331
Total hours in 6 years: 52596.0
Unique entries: 51888

2015-01-01 00:00:00+00:00 2020-12-01 23:00:00+00:00
Total hours in 6 years (-1 mos): 51852.0


In [155]:
make_table(wx_df, 'hourly_weather', c, conn)

[('region_data',), ('int_chars',), ('intersection_counts',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersction_locations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('hourly_congestion',), ('hourly_weather',)]


## 8) Create TABLE region_data from region_df

In [156]:
# THis time we only grab what we need

region_df = client.get("kf7e-cur8", # regional congestion current data
                         select='region_id, region, description, north, south, east, west',
                         limit=1000
                    )

# Convert to pandas DataFrame
region_df = pd.DataFrame.from_records(region_df)  # should only return most recent for each region

In [157]:
region_df = region_df.groupby('region_id').max().reset_index()

In [158]:
# need these as floats so we can compare them
region_df[['north', 'south', 'east', 'west']] = region_df[['north', 'south', 'east', 'west']].astype(float)
region_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 7 columns):
region_id      29 non-null object
region         29 non-null object
description    29 non-null object
north          29 non-null float64
south          29 non-null float64
east           29 non-null float64
west           29 non-null float64
dtypes: float64(4), object(3)
memory usage: 1.7+ KB


### Add region to my crash df

In [159]:
crash_df[['latitude', 'longitude']] = crash_df[['latitude', 'longitude']].astype(float)
crash_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60337 entries, 5 to 466380
Data columns (total 48 columns):
crash_record_id                  60337 non-null object
rd_no                            60337 non-null object
crash_date                       60337 non-null datetime64[ns]
posted_speed_limit               60337 non-null object
traffic_control_device           60337 non-null object
device_condition                 60337 non-null object
weather_condition                60337 non-null object
lighting_condition               60337 non-null object
first_crash_type                 60337 non-null object
trafficway_type                  60337 non-null object
alignment                        60337 non-null object
roadway_surface_cond             60337 non-null object
road_defect                      60337 non-null object
report_type                      58551 non-null object
crash_type                       60337 non-null object
damage                           60337 non-null object
pr

In [160]:
# add in the region for my crashes
# Resource hog
crash_df.columns


def which_region(lat, long, region_df):
    #print(lat, long)
    row = region_df[(region_df['east'] >= long) &
                    (region_df['west'] < long) &
                    (region_df['north'] >= lat) &
                    (region_df['south'] < lat)]['region_id'].max()
    return row

#df.iloc[:5]
# takes some 5min
crash_df['region_id'] = crash_df.apply(lambda x: which_region(x.latitude, x.longitude, region_df), axis=1)

In [161]:
len(crash_df)
crash_df.columns

crash_df['time'] = pd.to_datetime(crash_df.crash_date)
crash_df['year'] = crash_df.time.dt.year
crash_df['month'] = crash_df.time.dt.month
crash_df['day'] = crash_df.time.dt.day
crash_df['hour'] = crash_df.time.dt.hour
crash_df['weekday'] = crash_df.time.dt.weekday

In [162]:
make_table(region_df, 'region_data', c, conn)
print()
make_table(crash_df, 'signal_crashes', c, conn)  # also update my crash data

[('int_chars',), ('intersection_counts',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersction_locations',), ('intersection_cams',), ('all_crashes',), ('signal_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',)]

[('int_chars',), ('intersection_counts',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersction_locations',), ('intersection_cams',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('signal_crashes',)]


### Add region_id to intersection_cams
While I'm here and I have the function ready.
I would like to add region_id number to my red light camera (daily_violations TABLE)
The region there will help me link the daily_violations and hourly_congestion TABLEs

*** NOTE: Makes more sense to come back and put the region into the intersection_cameras table to speed this up

In [163]:
# 10 minutes
#rlc['region_id'] = 
int_cams['region_id'] = int_cams.apply(lambda x: which_region(x.latitude, x.longitude, region_df), axis=1)


In [164]:
## commit my change
make_table(int_cams, 'intersection_cams', c, conn)

[('int_chars',), ('intersection_counts',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersction_locations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('signal_crashes',), ('intersection_cams',)]


### Use this code to test any of your tables for proper data storage

In [165]:
query = c.execute("SELECT camera_id, violations FROM daily_violations;").fetchall()
print(query[:5])
print(len(query))

[('2763', 4), ('2552', 5), ('2764', 4), ('1503', 3), ('2141', 3)]
572324


## Before I go, I want to add intersections to my crashes to link the db tables

In [166]:
sql_fetch_tables(c, conn)

[('int_chars',),
 ('intersection_counts',),
 ('intersection_chars',),
 ('cam_locations',),
 ('cam_startend',),
 ('daily_violations',),
 ('intersction_locations',),
 ('all_crashes',),
 ('hourly_congestion',),
 ('hourly_weather',),
 ('region_data',),
 ('signal_crashes',),
 ('intersection_cams',)]

In [167]:
df = pd.read_sql_query("SELECT * FROM signal_crashes", conn)
camloc_df = pd.read_sql_query('SELECT * FROM cam_locations', conn)
ints_df = pd.read_sql_query('SELECT * FROM intersection_cams', conn)


In [249]:
#ints_df.astype({'longitude':float})
pd.options.display.max_rows = 200


60337

In [272]:
# Now I am desperate.  This takes too long to process.  Let's simplify it and make it a box instead.
box_side = 70  # effectively makes it check for crash being within 25m of interscection
box_lat = box_side / 111070 / 2 # 111070 is meters in deg lat in Chicago
box_long = box_side / 83000 / 2 # 83000 is meters in deg long in Chicago

def box_check(lat, long, int_df):
    n = lat + box_lat
    s = lat - box_lat
    e = long + box_long
    w = long - box_long
    # print('n', n, 's', s, 'e', e, 'w', w, 'lat:', lat, 'long:', long)
    answer = int_df[  (int_df['lat'] > s) &
                      (int_df['lat'] < n) &
                      (int_df['long'] > w) &
                      (int_df['long'] < e)
                      
                     ]
    if answer.empty: return None
    return answer['intersection'].values[0]

    
# THIS SEEMS TO WORK AT SPEED AND ELIMINATES MEMORY PROBLEM
for i in range(50): 
    lat = float(df.iloc[i]['latitude'])
    long = float(df.iloc[i]['longitude'])
    n = lat + box_lat
    s = lat - box_lat
    e = long + box_long
    w = long - box_long
    answer = int_df[  (int_df['lat'] > s) &
                      (int_df['lat'] < n) &
                      (int_df['long'] > w) &
                      (int_df['long'] < e)]['intersection'].values
    print(answer)
    
# 99th Halsted: 41.714230	-87.643043
# MOMENT OF TRUTH
df['intersection'] = df.apply(lambda x: box_check(float(x.latitude), float(x.longitude), int_df), axis=1)



[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]


In [273]:
df.intersection.count() / len(df)


0.13507466397069792

In [274]:
make_table(df, 'signal_crashes', c, conn)


[('int_chars',), ('intersection_counts',), ('intersection_chars',), ('cam_locations',), ('cam_startend',), ('daily_violations',), ('intersction_locations',), ('all_crashes',), ('hourly_congestion',), ('hourly_weather',), ('region_data',), ('intersection_cams',), ('signal_crashes',)]


## traffic_count TABLE from traffic_count
Congestion did not work in the model.  It is by region, and the regions added very little.
We have data from a traffic study that gives average volume of traffic by street segment.
We will try to match up the segment(s) to the cameras.  This might be tricky.