# Building a SQL database
This notebook builds the necessary db files for the project using SQLite3.

Most data is taken from Chicago Data Portal https://data.cityofchicago.org/ using Socrata library.
The API endpoints for the data are:
- Red Light Violations: https://data.cityofchicago.org/resource/spqx-js37.json
- Congestion by Region 2018-Present: https://data.cityofchicago.org/resource/kf7e-cur8.json
- Congestion by Region 2013-2018: https://data.cityofchicago.org/resource/emtn-qqdi.json
- Traffic Crashes: https://data.cityofchicago.org/resource/85ca-t3if.json

Weather data is taken from https://openweathermap.org/weather-data and is saved as csv in data folder

## Required Imports

In [34]:
import pandas as pd
from sodapy import Socrata
#import matplotlib.pyplot as plt
from datetime import datetime
from modules.myfuncs import *
import warnings
import numpy as np
from geopy.geocoders import Nominatim


warnings.filterwarnings('ignore')

## Get the data from portal using Socrata client


In [None]:
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:

url = "data.cityofchicago.org"
client = Socrata(url, None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofchicago.org,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")

## Use a Socrata client query to get all data
- rlc_cam is up to 1M redlight cams from 2015 to 2020
- crash_data is up to 1M crashes from 2015 to 2020
- traffic_data is up to 10M from 2015 to 2020

Weather data is taken from csv in data folder

In [3]:
# Red light violations
# Takes several minutes to run and holds about 500mb in memory to build

# First 1000000 results, returned as JSON from API / converted to Python list of dictionaries by sodapy
rlc_cam = client.get("spqx-js37", #speed cams are at 'hhkd-xvj4' if you want to investigate?
                     #where='violation_date > 01-01-2020',
                     where='violation_date BETWEEN \'2015-01-01T00:00:00.000\' AND \'2020-12-31T00:00:00.000\'',
                     limit=1000000,
                    )

rlc_df = pd.DataFrame.from_records(rlc_cam) # Convert to pandas DataFrame




# Crash Data
crash_data = client.get("85ca-t3if", 
                     where="crash_date BETWEEN \'2015-01-01T00:00:00.000\' AND \'2020-12-31T00:00:00.000\'",
                     limit=1000000,
                    )

crash_df = pd.DataFrame.from_records(crash_data) # Convert to pandas DataFrame




# Congestion Data
traffic_data = client.get("emtn-qqdi", 
                     #where="TIME > \'2015-01-01T00:00:00.000\'",
                     where='TIME BETWEEN \'2015-01-01T00:00:00.000\' AND \'2020-12-31T00:00:00.000\'',
                     limit=10000000,
                    )

traffic_df = pd.DataFrame.from_records(traffic_data) # Convert to pandas DataFrame

# Import weather data from csv
wx_df = pd.read_csv('data/chi_wx.csv')

## Preprocess Red Light Camera Data

Data Columns of interest:

INTERSECTION -
Intersection of the location of the red light enforcement camera(s). There may be more than one camera at each intersection. Plain Text

CAMERA ID -
A unique ID for each physical camera at an intersection, which may contain more than one camera. Plain Text

ADDRESS	-
The address of the physical camera (CAMERA ID). The address may be the same for all cameras or different, based on the physical installation of each camera. Plain Text

VIOLATION DATE -
The date of when the violations occurred. NOTE: The citation may be issued on a different date. Date & Time

VIOLATIONS - 
Number of violations for each camera on a particular day. Number

LATITUDE -
The latitude of the physical location of the camera(s) based on the ADDRESS column. Geocoded using the WGS84. Number

LONGITUDE -
The longitude of the physical location of the camera(s) based on the ADDRESS column. Geocoded using the WGS84.
Number

## Investigate rlc_df

In [15]:
rlc_df.info()
rlc_df.isna().sum()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 569593 entries, 0 to 569827
Data columns (total 10 columns):
intersection      569593 non-null object
camera_id         569593 non-null object
address           569593 non-null object
violation_date    569593 non-null datetime64[ns]
violations        569593 non-null int64
latitude          539918 non-null float64
longitude         539918 non-null float64
month             569593 non-null int64
weekday           569593 non-null int64
year              569593 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(4), object(3)
memory usage: 47.8+ MB


intersection          0
camera_id             0
address               0
violation_date        0
violations            0
latitude          29675
longitude         29675
month                 0
weekday               0
year                  0
dtype: int64

We see that we have all text/non-null objects.  Need to convert first before manipulating for preprocess.

There are a fair number of missing locations/lat/long.  Hope to be able to replace those missing values.
This represents a large enough portion of dataset that we should look them up.

The na values for camera_id will have to be dropped, since we don't know what they are.

We will not be using x andy y_coordinate, so we drop those.  We will also drop location.  We already have lat long in other columns.

In [17]:
#client_df.dropna(subset=['camera_id']).isna().sum()
try:
    # put this is a try in case we run it twice, it will skip it.
    rlc_df.dropna(subset=['camera_id'], inplace=True)
    
    # drop xy coord and location columns
    rlc_df = rlc_df.drop(columns=['x_coordinate', 'y_coordinate', 'location'], index=1)
except:
    pass



rlc_df.isna().sum()

intersection          0
camera_id             0
address               0
violation_date        0
violations            0
latitude          29675
longitude         29675
month                 0
weekday               0
year                  0
dtype: int64

Manipulate datatypes for preprocessing. 

In [18]:
rlc_df['violations'] = rlc_df['violations'].apply(int)
rlc_df['latitude'] = rlc_df['latitude'].apply(float)
rlc_df['longitude'] = rlc_df['longitude'].apply(float)
rlc_df['violation_date'] = pd.to_datetime(rlc_df['violation_date'])
rlc_df['month'] = rlc_df['violation_date'].apply(lambda x: int(x.month))
rlc_df['weekday'] = rlc_df['violation_date'].apply(lambda x: int(datetime.weekday(x)))
rlc_df['year'] = rlc_df['violation_date'].apply(lambda x: int(x.year))

rlc_df.head()

Unnamed: 0,intersection,camera_id,address,violation_date,violations,latitude,longitude,month,weekday,year
0,IRVING PARK AND KILPATRICK,2763,4700 W IRVING PARK ROA,2015-04-09,4,,,4,3,2015
2,115TH AND HALSTED,2552,11500 S HALSTED STREE,2015-04-08,5,,,4,2,2015
3,IRVING PARK AND KILPATRICK,2764,4700 W IRVING PARK ROA,2015-04-19,4,,,4,6,2015
4,ELSTON AND IRVING PARK,1503,3700 W IRVING PARK ROA,2015-04-23,3,,,4,3,2015
5,4700 WESTERN,2141,4700 S WESTERN AVENUE,2019-06-05,3,41.808378,-87.684571,6,2,2019


## Make a df with info for each camera
Will contain the following:
- camera_id
- location
- start date (when was the camera turned on)
- end date (when was the camera turned off)

In [19]:
cam_df = rlc_df.copy()
cam_df['start'] = cam_df['camera_id'].apply(lambda x: None)
cam_df['end'] = cam_df['camera_id'].apply(lambda x: None)

In [24]:
cam_start = cam_df.groupby(['camera_id'])['violation_date'].min().reset_index()
cam_end = cam_df.groupby(['camera_id'])['violation_date'].max().reset_index()

cam_startend = cam_start.copy()

#print(cam_end[cam_end['camera_id']=='1503'].values[0][1])  # for testing output
cam_startend['end'] = cam_start['camera_id'].apply(lambda x: cam_end[cam_end['camera_id']==x].values[0][1])

cam_startend.rename(columns={"violation_date": "start"}, inplace=True)
                                                   
print('NA values in cam_startend:', cam_startend.isna().sum(), end='\n\n', sep='\n')

print('Describe cam_startend:', cam_startend.describe(), end='\n\n', sep='\n')

cam_startend.head()

NA values in cam_startend
camera_id    0
start        0
end          0
dtype: int64

Describe cam_startend
       camera_id                start                  end
count        363                  363                  363
unique       363                   18                   18
top         3032  2015-01-01 00:00:00  2020-12-20 00:00:00
freq           1                  285                  256
first        NaN  2015-01-01 00:00:00  2015-03-02 00:00:00
last         NaN  2018-03-05 00:00:00  2020-12-20 00:00:00



Unnamed: 0,camera_id,start,end
0,1002,2015-01-01,2020-12-19
1,1003,2015-01-01,2020-12-19
2,1011,2015-01-02,2020-12-20
3,1014,2015-01-01,2020-12-20
4,1023,2015-01-02,2020-12-20


## Make a db table that has camera locations and intersections
Intersections are present (and addresses), but we do not have lat/long info for all cams

In [32]:
rlc_df.groupby('camera_id')['latitude'].max()  # Some cams do not have any data for lat long at all


# Some of the addresses are truncated and not able to lookup with geocode

address_fix = {'2400 W VAN BUREN STREE': '2400 W VAN BUREN STREET',
               '4700 W IRVING PARK ROA': '4700 W IRVING PARK ROAD',
               '11500 S HALSTED STREE': '11500 S HALSTED STREET',
               '5500 S WENTWORTH AVEN': '5500 S WENTWORTH AVENUE',
                '10300 S HALSTED STREE': '10300 S HALSTED STREET',
               '3700 W IRVING PARK ROA': '3700 W IRVING PARK ROAD',
               '1600 W IRVING PARK ROA': '1600 W IRVING PARK ROAD',
               '7900 S JEFFERY BOULEV': '7900 S JEFFERY BOULEVARD',
               '2800 W IRVING PARK ROA': '2800 W IRVING PARK ROAD',
               '5200 W IRVING PARK ROA': '5200 W IRVING PARK ROAD',
               '3100 S DR MARTIN L KING': '3100 S MARTIN KING DRIVE',
               '1600 W DIVERSEY PARKWA': '1600 W DIVERSEY PARKWAY',
               '140 W KINZIE': '140 W Kinzie St',
                '150 N SACRAMENTO BOUL': '150 N SACRAMENTO BOUL',
               '800 N SACRAMENTO AVEN':'800 N SACRAMENTO AVENUE',
               '3200 N LAKESHORE DRIV':'3200 N LAKE SHORE DRIVE',
               '6400 W FULLERTON AVENU':'6400 W FULLERTON AVENUE',
               '6400 N MILWAUKEE AVEN':'6400 N MILWAUKEE AVENUE',
               '7900 S STONEY ISLAND':'7900 S Stony Island Ave',  
               '150 N SACRAMENTO BOUL':'150 N SACRAMENTO BOULEVARD',
                '3200 N LAKESHORE DRIVE':'3200 N Lake Shore Dr',
               '7900 S STONEY ISLAND AVENUE':'7900 S Stony Island Ave',
               '5600 W FULLERTON AVENU':'5600 W FULLERTON AVENUE',
               '8700 S LAFAYETTE AVEN':'8700 S LAFAYETTE AVENUE',
               '4400 N MILWAUKEE AVEN':'4400 N MILWAUKEE AVENUE',
              }

In [44]:
# we had some incorrect data in the code below, but have a creative fix.

cam_locs = rlc_df.groupby(['camera_id', 'intersection']).max().reset_index()
cam_locs.head()

# we find there is a mismatch between lens, one of them is duplicated
len(cam_locs)  # 364 total
len(cam_locs['camera_id'].unique()) # 363

cam_locs[cam_locs['camera_id'].duplicated()]  # 1421 is dupe
print('Two of them\n', cam_locs[cam_locs['camera_id'] == '1421'])  # we see two of them
print()

# Which one is it?
print('Damen/Diversey', rlc_df[(rlc_df['camera_id']=='1421') & (rlc_df['intersection']=='DAMEN AND DIVERSEY')]['camera_id'].count())
print('Laramie/Fullerton:', rlc_df[(rlc_df['camera_id']=='1421') & (rlc_df['intersection']=='LARAMIE AND FULLERTON')]['camera_id'].count())

# Turns out that a camera has two locations. One was only used one time.  We drop it.
cam_locs = cam_locs[(cam_locs['camera_id']!='1421') | (cam_locs['intersection']!='DAMEN AND DIVERSEY')]
print("Total cams", len(cam_locs))  # 363 total (got rid of the bad one)

Two of them
    camera_id           intersection                  address violation_date  \
83      1421     DAMEN AND DIVERSEY  2000 W DIVERSEY PARKWAY     2017-11-30   
84      1421  LARAMIE AND FULLERTON    2400 N LARAMIE AVENUE     2020-12-19   

    violations   latitude  longitude  month  weekday  year  
83           1  41.932394 -87.678173     11        3  2017  
84           6  41.924152 -87.756295     12        6  2020  

Damen/Diversey 1
Laramie/Fullerton: 1210
Total cams 363


In [45]:
cam_locs.isna().sum()  # missing location for 19 cameras.  Let's fix it

camera_id          0
intersection       0
address            0
violation_date     0
violations         0
latitude          19
longitude         19
month              0
weekday            0
year               0
dtype: int64

Looks like we also are missing 19 of the 363 cam locations.  Let's look it up!

In [46]:

'''
This section goes through all of the rlc and assigns latlong
Many lights are missing it.  
For each light, there is an address though.
We use geocoding to get the latlong
'''

# let's get all of the red light cameras with their gps location.  
# This will aid in placing the accidents at rlc intersections later (if closer than threshold point to point)
# Some RLCs are missing location data,  but have addresses.  I can use geocoding I guess to look them up.


geolocator = Nominatim(user_agent="https://github.com/sciencelee/chicago_rlc")  # please change to match repo

# Some example code
#location = geolocator.geocode("175 5th Avenue NYC")
#print(location.address)
# out: Flatiron Building, 175, 5th Avenue, Flatiron, New York, NYC, New York, ...

#print((location.latitude, location.longitude))
# out: (40.7410861, -73.9896297241625)

#print(location.raw)
# out: {'place_id': '9167009604', 'type': 'attraction', ...}


# CAN USE THIS TO FIGURE OUT MY LAT LONG FROM RLC ADDRESS (or crash later)   
 
def get_geocode(lat, long, address):
    if lat > 0:  # it's a location
        return (lat, long)
    else: # it's a proper location tuple, and assumed to be correct latlong
        if address in address_fix.keys(): address = address_fix[address]  # errors in the dataset chars omitted
        # if we make it this far, we have no record of this cam_id yet, and it doesn't have a proper location
        location = geolocator.geocode(address + ', Chicago, IL')
        if location == None:
            print(address+':'+address+' : could not geolocate') # print it out if we can't find (address errors)
        else:
            return (location.latitude, location.longitude)

        


# Got it down to one-liner for this. Found out you can't extract and assign series like you can variables
cam_locs['location'] = cam_locs.apply(lambda x: get_geocode(x.latitude, x.longitude, x.address), axis=1)

In [None]:
cam_locs['location'].head()
cam_locs['latitude'] = cam_locs['location'].apply(lambda x: x[0])
cam_locs['longitude'] = cam_locs['location'].apply(lambda x: x[1])

cam_locs = cam_locs.drop(columns=['violation_date', 'violations', 'month', 'weekday', 'year', 'location'])

In [57]:
cam_locs.info()
cam_locs.isna().sum() # No longer missing location for 19 cameras.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 363 entries, 0 to 363
Data columns (total 5 columns):
camera_id       363 non-null object
intersection    363 non-null object
address         363 non-null object
latitude        363 non-null float64
longitude       363 non-null float64
dtypes: float64(2), object(3)
memory usage: 17.0+ KB


camera_id       0
intersection    0
address         0
latitude        0
longitude       0
dtype: int64

## Add a df of intersections with lat long
This should help us later determine if crash is at intersection

I chose to groupby the intersection and aggregate the most commonly occuring lat/long value

In [58]:
#results_df.groupby(['intersection', 'latitude', 'longitude']).reset_index()
intersection_df = rlc_df.groupby(['intersection']).agg({'latitude':pd.Series.mode,'longitude':pd.Series.mode,}).reset_index()

In [64]:
intersection_df.info()
intersection_df.head()  # that was easy!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183 entries, 0 to 182
Data columns (total 3 columns):
intersection    183 non-null object
latitude        183 non-null object
longitude       183 non-null object
dtypes: object(3)
memory usage: 4.4+ KB


Unnamed: 0,intersection,latitude,longitude
0,111TH AND HALSTED,41.6923,-87.6425
1,115TH AND HALSTED,41.6852,-87.6423
2,119TH AND HALSTED,41.6777,-87.6421
3,31ST AND CALIFORNIA,41.8373,-87.6952
4,31ST ST AND MARTIN LUTHER KING DRIVE,41.8385,-87.617


### We still have missing lat/long info for our rlc_df.  Let's fix it

In [66]:
rlc_df.isna().sum()  

intersection          0
camera_id             0
address               0
violation_date        0
violations            0
latitude          29675
longitude         29675
month                 0
weekday               0
year                  0
dtype: int64

In [68]:
# THIS TAKES SOME TIME (8min on my macbook pro)
def read_loc(camloc_df, lat, long, cam_id):
    cam = camloc_df[camloc_df['camera_id']==cam_id]
    return (float(cam['latitude']), float(cam['longitude']))
        

read_loc(cam_locs, 45, 87, '1002')  # testing purposes
#results_df[:5].apply(lambda x: x.latitude, axis=1)  # tsting purpose

# create another
rlc_df['location'] = rlc_df.apply(lambda x: read_loc(cam_locs, x.latitude, x.longitude, x.camera_id), axis=1)

In [69]:
# then add in the new lat longs to the df
rlc_df['latitude'] = rlc_df['location'].apply(lambda x: x[0])
rlc_df['longitude'] = rlc_df['location'].apply(lambda x: x[1])

In [74]:
if 'location' in rlc_df.columns:
    rlc_df.drop(columns=['location'], inplace=True)

## Crash data preprocessing

In [77]:
# drop a few columns we don't need, including location (we have lat/long)
dropme = ['statements_taken_i', 'private_property_i', 'photos_taken_i', 'dooring_i', 'date_police_notified','location']

crash_df.drop(columns=dropme, inplace=True)

In [78]:
crash_df.isna().sum()

crash_record_id                       0
rd_no                               577
crash_date                            0
posted_speed_limit                    0
traffic_control_device                0
device_condition                      0
weather_condition                     0
lighting_condition                    0
first_crash_type                      0
trafficway_type                       0
alignment                             0
roadway_surface_cond                  0
road_defect                           0
report_type                       11247
crash_type                            0
hit_and_run_i                    330448
damage                                0
prim_contributory_cause               0
sec_contributory_cause                0
street_no                             0
street_direction                      3
street_name                           1
beat_of_occurrence                    5
num_units                             0
most_severe_injury                  943


We have 2.5k entries that have no location.  Let's drop them

In [84]:
crash_df.dropna(subset=['latitude',], inplace=True)  # get rid of na locations

## Let's look at what is in the data   

In [85]:
# What's in this data?
col_interest = ['traffic_control_device', 'device_condition', 'weather_condition',
       'lighting_condition', 'first_crash_type', 'trafficway_type',
       'alignment', 'roadway_surface_cond', 'road_defect', 'report_type',
       'crash_type', 'hit_and_run_i', 'damage', 'prim_contributory_cause',
       'sec_contributory_cause', 'street_no', 'street_direction',
       'street_name', 'beat_of_occurrence', 'num_units', 'most_severe_injury', 
        'injuries_fatal', 'injuries_incapacitating',
       'injuries_non_incapacitating', 'injuries_reported_not_evident',
       'injuries_no_indication', 'injuries_unknown', 'crash_hour',
       'crash_day_of_week', 'crash_month', 'latitude', 'longitude', 'lane_cnt',
       'intersection_related_i', 'crash_date_est_i',
       'work_zone_i', 'work_zone_type',
       'workers_present_i']

for col in col_interest:
    print(col, crash_df[col].unique())

traffic_control_device ['UNKNOWN' 'NO CONTROLS' 'TRAFFIC SIGNAL' 'STOP SIGN/FLASHER'
 'SCHOOL ZONE' 'PEDESTRIAN CROSSING SIGN' 'YIELD'
 'FLASHING CONTROL SIGNAL' 'POLICE/FLAGMAN' 'RR CROSSING SIGN'
 'RAILROAD CROSSING GATE' 'OTHER RAILROAD CROSSING' 'DELINEATORS'
 'NO PASSING' 'BICYCLE CROSSING SIGN']
device_condition ['UNKNOWN' 'NO CONTROLS' 'FUNCTIONING PROPERLY' 'OTHER'
 'FUNCTIONING IMPROPERLY' 'NOT FUNCTIONING' 'WORN REFLECTIVE MATERIAL'
 'MISSING']
weather_condition ['UNKNOWN' 'CLEAR' 'SNOW' 'RAIN' 'CLOUDY/OVERCAST' 'FOG/SMOKE/HAZE'
 'OTHER' 'FREEZING RAIN/DRIZZLE' 'SLEET/HAIL' 'SEVERE CROSS WIND GATE'
 'BLOWING SNOW' 'BLOWING SAND, SOIL, DIRT']
lighting_condition ['DARKNESS, LIGHTED ROAD' 'DAYLIGHT' 'DARKNESS' 'DUSK' 'DAWN' 'UNKNOWN']
first_crash_type ['PARKED MOTOR VEHICLE' 'REAR END' 'ANGLE' 'SIDESWIPE SAME DIRECTION'
 'TURNING' 'REAR TO FRONT' 'PEDESTRIAN' 'PEDALCYCLIST' 'FIXED OBJECT'
 'SIDESWIPE OPPOSITE DIRECTION' 'OTHER NONCOLLISION' 'REAR TO SIDE'
 'OTHER OBJECT' 'HEAD O

intersection_related_i [nan 'Y' 'N']
crash_date_est_i [nan 'Y' 'N']
work_zone_i [nan 'Y' 'N']
work_zone_type [nan 'UTILITY' 'CONSTRUCTION' 'MAINTENANCE' 'UNKNOWN']
workers_present_i [nan 'Y' 'N']


This helps us.  
We can filter 'traffic_control_device' == 'TRAFFIC SIGNAL'.  
We can filter 'intersection_related_i' == 'Y'

This will leave us with only crashes that occurred at/because of intersections, and with a signal at the intersection.

intersection_related_i: A field observation by the police officer whether an intersection played a role in the crash. Does not represent whether or not the crash occurred within the intersection.

In [213]:
crash_df = crash_df[(crash_df['traffic_control_device']=='TRAFFIC SIGNAL') & \
                    (crash_df['intersection_related_i']=='Y')]

## We will now focus on trying to bring rlc intersections to our crashes
We find that we have 363 cameras at 183 intersections

In [214]:
cam_locs

Unnamed: 0,camera_id,intersection,address,latitude,longitude
0,1002,WESTERN AND CERMAK,2200 S WESTERN AVENUE,41.851984,-87.685786
1,1003,WESTERN AND CERMAK,2400 W CERMAK ROAD,41.852141,-87.685753
2,1011,PETERSON AND WESTERN,6000 N WESTERN AVE,41.990586,-87.689822
3,1014,PETERSON AND WESTERN,2400 W PETERSON,41.990609,-87.689735
4,1023,IRVING PARK AND NARRAGANSETT,6400 W IRVING PK,41.953025,-87.786683
...,...,...,...,...,...
359,3051,LAKE AND UPPER WACKER,200 N UPPER WACKER DR,41.886067,-87.636537
360,3052,LAKE AND UPPER WACKER,340 W UPPER WACKER DR,41.886997,-87.626608
361,3072,DAMEN AND ELSTON,2426 N DAMEN AVE,41.925732,-87.678089
362,3082,MICHIGAN AND ONTARIO,628 N MICHIGAN AVE,41.893432,-87.624364


In [194]:
int_cams = cam_locs.groupby(['intersection']) \
                    .agg({'latitude':pd.Series.max, 'longitude':pd.Series.max,}) \
                    .reset_index()

int_cams['cam1'] = int_cams['intersection'] \
                            .apply(lambda x: cam_locs[cam_locs['intersection']==x]['camera_id'].iloc[0])

int_cams['cam2'] = int_cams['intersection'].apply( \
                            lambda x: None if len(cam_locs[cam_locs['intersection']==x])==1 \
                            else cam_locs[cam_locs['intersection']==x]['camera_id'].iloc[1])

int_cams['cam3'] = int_cams['intersection'].apply( \
                            lambda x: None if len(cam_locs[cam_locs['intersection']==x])<3 \
                            else cam_locs[cam_locs['intersection']==x]['camera_id'].iloc[2])                             

int_cams.head()



Unnamed: 0,intersection,latitude,longitude,cam1,cam2,cam3
0,111TH AND HALSTED,41.692465,-87.642441,2422,2424,
1,115TH AND HALSTED,41.68519,-87.64228,2552,2553,
2,119TH AND HALSTED,41.677923,-87.64199,2402,2404,
3,31ST AND CALIFORNIA,41.838438,-87.687713,2061,2064,
4,31ST ST AND MARTIN LUTHER KING DRIVE,41.838534,-87.61564,2121,2123,


In [197]:
print('Total Cameras', len(cam_locs))
print('Total Intersections', len(int_cams))

Total Cameras 363
Total Intersections 183


### Add the intersection to my crashes


In [215]:
# Now I am desperate.  This takes too long to process.  Let's simplify it and make it a box instead.
box_side = 50  # effectively makes it check for crash being within 25m of interscection
box_lat = box_side / 111070 / 2 # 111070 is meters in deg lat in Chicago
box_long = box_side / 83000 / 2 # 83000 is meters in deg long in Chicago

def box_check(lat, long, ints_df):
    answer = (ints_df[  (ints_df['latitude'] > (lat - box_lat)) & 
                      (ints_df['latitude'] < (lat + box_lat)) &
                      (ints_df['longitude'] > (long - box_long)) &
                      (ints_df['longitude'] < (long + box_long))
                     ])
    if answer.empty: return None
    return answer['intersection'].values[0]
    
# THIS SEEMS TO WORK WITH SPEED AND ELIMINATES MEMORY PROBLEM
for i in range(100): #(len(df)):
    intersect = box_check(float(crash_df.iloc[i]['latitude']), 
                          float(crash_df.iloc[i]['longitude']), 
                          int_cams)
    if intersect: print(intersect)
    
    
# MOMENT OF TRUTH (takes a few minutes, but at least it runs.  This caused me lots of trouble)
crash_df['intersection'] = crash_df.apply(lambda x: box_check(float(x.latitude), 
                                                              float(x.longitude), 
                                                              int_cams), axis=1)

WESTERN AND 79TH
BLUE ISLAND AND DAMEN
LAWRENCE AND WESTERN
HALSTED AND 95TH
WESTERN AND FULLERTON
WESTERN AND ARMITAGE
LAFAYETTE AND 87TH
ASHLAND AND LAWRENCE
WESTERN AND FULLERTON
HOLLYWOOD AND SHERIDAN
DIVERSEY AND WESTERN
PULASKI AND MONTROSE
BROADWAY/SHERIDAN AND DEVON
CICERO AND 47TH
PULASKI AND NORTH
ASHLAND AND 47TH
4700 WESTERN
PETERSON AND WESTERN
LAFAYETTE AND 87TH
JEFFERY AND 79TH
COTTAGE GROVE AND 95TH
CALIFORNIA AND DIVERSEY
JEFFERY AND 79TH
HALSTED AND 63RD
CICERO AND PETERSON
ARCHER/NARRAGANSETT AND 55TH
DIVERSEY AND AUSTIN
55TH AND KEDZIE
MADISON AND WESTERN
ASHLAND AND IRVING PARK
99TH AND HALSTED
PETERSON AND WESTERN
ASHLAND AND MADISON
HOLLYWOOD AND SHERIDAN
COTTAGE GROVE AND 95TH
KEDZIE AND 26TH
HALSTED AND MADISON
CICERO AND WASHINGTON
BROADWAY/SHERIDAN AND DEVON
99TH AND HALSTED
PULASKI AND MONTROSE
WESTERN AND NORTH
WESTERN AND ADDISON
KEDZIE AND 31ST
CALIFORNIA AND DIVERSEY
DAMEN AND 63RD
HALSTED AND 95TH
FULLERTON AND NARRAGANSETT
HALSTED AND 103RD
SACRAMENTO 

In [216]:
crash_df['crash_date'] = pd.to_datetime(crash_df['crash_date'])
crash_df['year'] = crash_df['crash_date'].apply(lambda x: int(x.year))
crash_df['month'] = crash_df['crash_date'].apply(lambda x: int(x.month))
crash_df['day'] = crash_df['crash_date'].apply(lambda x: int(x.day))
crash_df['hour'] = crash_df['crash_date'].apply(lambda x: int(x.hour))

In [217]:
crash_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60332 entries, 3 to 466328
Data columns (total 48 columns):
crash_record_id                  60332 non-null object
rd_no                            60260 non-null object
crash_date                       60332 non-null datetime64[ns]
posted_speed_limit               60332 non-null object
traffic_control_device           60332 non-null object
device_condition                 60332 non-null object
weather_condition                60332 non-null object
lighting_condition               60332 non-null object
first_crash_type                 60332 non-null object
trafficway_type                  60332 non-null object
alignment                        60332 non-null object
roadway_surface_cond             60332 non-null object
road_defect                      60332 non-null object
report_type                      58554 non-null object
crash_type                       60332 non-null object
hit_and_run_i                    11420 non-null object
da

## Now add congestion region number to my crash data


In [None]:
traffic_df.head()

In [None]:
# need to get rid of congestion if there aren't enough buses to properly measure it.
# choosing to just fill it in with a .9 quantile

speed90 = traffic_df['speed'].quantile(0.9)
print(speed90)
#traffic_df['speed'].apply()

# Now we dump all of the data into a single db with multiple tables

## Create/connect to db and build the TABLEs

In [54]:
# Create a db
conn = create_connection('database/rlc.db')  # function I created in myfuncs file
c = conn.cursor()
#conn.close()

sqlite3 version: 2.6.0
connected to database/rlc.db


### daily_violations TABLE - from rlc_df

In [73]:
sql_fetch_tables(c, conn)  # helper function in myfuncs
delete_all_entries(c, conn, 'daily_violations') # in myfuncs
rlc_df.to_sql('daily_violations', conn, if_exists='replace', index = False)

[('cam_startend',), ('intersection_locations',), ('hourly_congestion',), ('congestion_regions',), ('hourly_weather',), ('intersection_cams',), ('signal_crashes',), ('cam_locations',), ('inersection_locations',), ('daily_violations',)]


### cam_locations TABLE - from cam_locs

In [62]:
sql_fetch_tables(c, conn)  # helper function in myfuncs
delete_all_entries(c, conn, 'cam_locations') # in myfuncs
cam_locs.to_sql('cam_locations', conn, if_exists='replace', index = False)

[('cam_startend',), ('intersection_locations',), ('hourly_congestion',), ('congestion_regions',), ('hourly_weather',), ('intersection_cams',), ('signal_crashes',), ('cam_locations',), ('daily_violations',)]


### intersection_locations TABLE - from intersection_df

In [63]:
sql_fetch_tables(c, conn)  # helper function in myfuncs
delete_all_entries(c, conn, 'intersection_locations') # in myfuncs
intersection_df.to_sql('inersection_locations', conn, if_exists='replace', index = False)

[('cam_startend',), ('intersection_locations',), ('hourly_congestion',), ('congestion_regions',), ('hourly_weather',), ('intersection_cams',), ('signal_crashes',), ('daily_violations',), ('cam_locations',)]


### intersection_cams TABLE - from int_cams

In [195]:
sql_fetch_tables(c, conn)  # helper function in myfuncs
delete_all_entries(c, conn, 'intersection_cams') # in myfuncs
int_cams.to_sql('inersection_cams', conn, if_exists='replace', index = False)

[('cam_startend',), ('intersection_locations',), ('hourly_congestion',), ('congestion_regions',), ('hourly_weather',), ('intersection_cams',), ('signal_crashes',), ('cam_locations',), ('inersection_locations',), ('daily_violations',)]


### signal_crashes TABLE - from crash_df

In [198]:
sql_fetch_tables(c, conn)  # helper function in myfuncs
delete_all_entries(c, conn, 'signal_crashes') # in myfuncs
crash_df.to_sql('signal_crashes', conn, if_exists='replace', index = False)

[('cam_startend',), ('intersection_locations',), ('hourly_congestion',), ('congestion_regions',), ('hourly_weather',), ('intersection_cams',), ('signal_crashes',), ('cam_locations',), ('inersection_locations',), ('daily_violations',), ('inersection_cams',)]


### Use this code to test any of your tables for proper data storage

In [29]:
query = c.execute("SELECT camera_id, violations FROM daily_violations;").fetchall()
print(query[:5])
print(len(query))

[('2763', 4), ('2552', 5), ('2764', 4), ('1503', 3), ('2141', 3)]
569593
