# Processing

## Author: Tilova Shahrin

Table of Contents:

- [Coordinates API using Geopy](#geoapi)

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
from scipy import stats
import seaborn as sns
import os
import glob
import datetime
from geopy.geocoders import Nominatim
import time

Recall we downloaded a new csv file from our cleaning in the last file `Cleaning and EDA`. Let's upload the file into a dataframe.

In [4]:
parking_df = pd.read_csv('../data/parking_df.csv')

In [17]:
parking_df.head()

Unnamed: 0,date_of_infraction,infraction_code,infraction_description,set_fine_amount,time_of_infraction,location1,location2,location3,location4,province,datetime_of_infraction
0,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,1546 BLOOR ST W,,,ON,2016-12-30 16:37:00
1,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,5418 YONGE ST,,,ON,2016-12-30 16:37:00
2,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,OPP,777 QUEEN ST W,,,ON,2016-12-30 16:37:00
3,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,747 QUEEN ST E,,,ON,2016-12-30 16:37:00
4,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,3042 DUNDAS ST W,,,ON,2016-12-30 16:37:00


In [18]:
parking_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12960604 entries, 0 to 12960603
Data columns (total 11 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   date_of_infraction      object 
 1   infraction_code         float64
 2   infraction_description  object 
 3   set_fine_amount         int64  
 4   time_of_infraction      object 
 5   location1               object 
 6   location2               object 
 7   location3               object 
 8   location4               object 
 9   province                object 
 10  datetime_of_infraction  object 
dtypes: float64(1), int64(1), object(9)
memory usage: 1.1+ GB


In [40]:
parking_df.duplicated().sum()

0

In [36]:
parking_df['datetime_of_infraction'].astype('datetime64[ns]')

0          2016-12-30 16:37:00
1          2016-12-30 16:37:00
2          2016-12-30 16:37:00
3          2016-12-30 16:37:00
4          2016-12-30 16:37:00
                   ...        
12960599   2022-12-12 09:46:00
12960600   2022-12-12 09:46:00
12960601   2022-12-12 09:47:00
12960602   2022-12-12 09:47:00
12960603   2022-12-12 09:47:00
Name: datetime_of_infraction, Length: 12960604, dtype: datetime64[ns]

<a id='geoapi'></a>
## Geocoders  API

I want to find a way to produce geospatial analysis for my machine learning models. With that, I need geo coordinates `latitude` and `longitude`. 

We're going to use this code snippet to change address to a coordinate. However, we need to run this geolocator a lot of times.  

In [20]:
import pandas as pd
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="toronto-parking-application")
location = geolocator.geocode("Edward St Toronto ON Canada")
print(location.address)
print((location.latitude, location.longitude))

Edward Street, Discovery District, University—Rosedale, Old Toronto, Toronto, Golden Horseshoe, Ontario, M5B 1R7, Canada
(43.6566007, -79.3831637)


Recall the shape. That is a lot of rows to process, about 13 million, even after cleaning this is going to take a while. We need to find a way to reduce the amount of API calls!

In [21]:
parking_df.shape

(12960604, 11)

Let's get the unique values of the addresses, and apply those addresses onto any duplicate addresses. Recall the addresses with the most tickets. Some have about 30 thousand tickets. To reduce, we get the unique value, add to dictionary and find a way to use that same address after getting the coordinate. 

In [39]:
parking_df['location2'].nunique()

482566

About 400k unique values. That is a lot less than the number of rows. We can use these addresses and process them onto the API. 

In [22]:
parking_df['location2'].unique()

array(['1546 BLOOR ST W', '5418 YONGE ST', '777 QUEEN ST W', ...,
       '28 LAMBERTON BLVD', '30151 GLENDALE AVE', '576 FORMAN AVE'],
      dtype=object)

In [23]:
def geocode_address(address):
    geolocator = Nominatim(user_agent="toronto-parking-application")
    location = geolocator.geocode(address, timeout=10)
    if location:
        return location.latitude, location.longitude
    else:
        return None, None

Let's apply the code snippet from before to get the coordinates of these unique values. 

In [None]:
import json
from IPython.display import clear_output

#opens existing json file
f = open('address_data.json')
data = json.load(f)
f.close()

#initialize count
count = 0

#get all the unique addresses in lower case
unique_addresses = parking_df['location2'].str.lower().unique()


try:
    for address in unique_addresses:
        #add city, province and country to address to get an accurate coordinate
        address += ', toronto, on, canada'
        
        #if address isn't in dictionary
        if address not in data.keys():
            clear_output(wait=True)
            print(round(len(data.keys())/parking_df['location2'].nunique()*100, 4), flush=True)
            
            #add count
            count += 1
            
            #get coordinates from api geocode_address
            coord = geocode_address(address)
            
            #add to dictionary
            data[address] = coord
            
            if count == 100:
                #once we save 100 coordinates, write onto json file
                with open('address_data.json', 'w') as json_file:
                    json.dump(data, json_file)

                # Reset the counter
                counter = 0
                
#if I need to stop, dump coordinates from the dict into json              
except KeyboardInterrupt:
    print("Saving the collected data to a JSON file.")
    with open('address_data.json', 'w') as json_file:
        json.dump(data, json_file)
    raise

In [1]:
import json
f = open('address_data.json')
data = json.load(f)
len(data)

114736

Now let's apply these coordinates as new columns in the dataframe. 

In [21]:
parking_df['latitude'] = None
parking_df['longitude'] = None
parking_df.head()

Unnamed: 0,date_of_infraction,infraction_code,infraction_description,set_fine_amount,time_of_infraction,location1,location2,location3,location4,province,datetime_of_infraction,latitude,longitude
0,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,1546 BLOOR ST W,,,ON,2016-12-30 16:37:00,,
1,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,5418 YONGE ST,,,ON,2016-12-30 16:37:00,,
2,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,OPP,777 QUEEN ST W,,,ON,2016-12-30 16:37:00,,
3,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,747 QUEEN ST E,,,ON,2016-12-30 16:37:00,,
4,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,3042 DUNDAS ST W,,,ON,2016-12-30 16:37:00,,


In [None]:
import json

with open('address_data.json', 'r') as json_file:
    data = json.load(json_file)

# Iterate through unique addresses in parking_df['location2']
for address in parking_df['location2'].str.lower().unique():
    address += ', toronto, on, canada'
    if address in data:    
        latitude, longitude = data[address]
        row_index = (parking_df['location2'].str.lower() == address)
        
        parking_df[parking_df.loc[row_index, 'latitude']] = latitude
        parking_df[parking_df.loc[row_index, 'longitude']] = longitude
        

In [30]:
parking_df.head()

Unnamed: 0,date_of_infraction,infraction_code,infraction_description,set_fine_amount,time_of_infraction,location1,location2,location3,location4,province,datetime_of_infraction,latitude,longitude
0,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,1546 BLOOR ST W,,,ON,2016-12-30 16:37:00,,
1,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,5418 YONGE ST,,,ON,2016-12-30 16:37:00,,
2,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,OPP,777 QUEEN ST W,,,ON,2016-12-30 16:37:00,,
3,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,747 QUEEN ST E,,,ON,2016-12-30 16:37:00,,
4,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,3042 DUNDAS ST W,,,ON,2016-12-30 16:37:00,,


### Clustering Parking Infractions

In [32]:
parking_df['infraction_description'].value_counts().head(10)

infraction_description
PARK ON PRIVATE PROPERTY          2618813
PARK-SIGNED HWY-PROHIBIT DY/TM    2114718
PARK PROHIBITED TIME NO PERMIT    1861152
PARK MACHINE-REQD FEE NOT PAID    1582572
STOP-SIGNED HWY-PROHIBIT TM/DY     694447
PARK - LONGER THAN 3 HOURS         613687
STAND VEH.-PROHIBIT TIME/DAY       467929
PARK-VEH. W/O VALID ONT PLATE      363369
PARK-SIGNED HWY-EXC PERMT TIME     359185
STOP-SIGNED HIGHWAY-RUSH HOUR      344968
Name: count, dtype: int64

In [28]:
infraction_unique = parking_df['infraction_description'].unique().astype(object)

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

#Tokenizing these reviews
parking_df["infraction_description"].str.split(' ')


# instantiate
bagofwords = CountVectorizer(stop_words="english")

# fit
bagofwords.fit(parking_df["infraction_description"])

# transform
parking_df_transformed = bagofwords.transform(parking_df["infraction_description"])
parking_df_transformed

<13353939x321 sparse matrix of type '<class 'numpy.int64'>'
	with 59262194 stored elements in Compressed Sparse Row format>

In [31]:
bagofwords.get_feature_names_out()

array(['0m', '15m', '18', '1h', '20', '24hr', '2am', '30', '30cm', '3h',
       '3m', '45', '45deg', '5m', '60', '60deg', '6am', '6m', '7days',
       '9m', 'acc', 'accessible', 'activ', 'activate', 'adjacent',
       'alongside', 'angle', 'approach', 'area', 'avenue', 'basin',
       'bcycl', 'bef', 'bicycle', 'bld', 'bldg', 'block', 'boul',
       'boulevard', 'boulevd', 'bridge', 'bst', 'bstp', 'bus', 'bycl',
       'car', 'card', 'centre', 'chg', 'clse', 'cm', 'cntre', 'commerc',
       'commercial', 'condition', 'conn', 'connect', 'connected', 'cons',
       'consen', 'consent', 'cont', 'contr', 'contrary', 'crb', 'cross',
       'crosswalk', 'crsswlk', 'curb', 'cycle', 'day', 'days', 'dead',
       'dec1', 'deg', 'dep', 'deposit', 'desig', 'designate',
       'designated', 'display', 'displayed', 'door', 'driveway', 'drng',
       'drop', 'drp', 'drway', 'dy', 'elec', 'elect', 'electric',
       'elevated', 'encl', 'end', 'enter', 'entry', 'ev', 'exc', 'excav',
       'exceeds', 