# Processing

## Author: Tilova Shahrin

Table of Contents:

- [Coordinates API using Geopy](#geoapi)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
from scipy import stats
import seaborn as sns
import os
import glob
import datetime

Recall we downloaded a new csv file from our cleaning in the last file `Cleaning and EDA`. Let's upload the file into a dataframe.

In [2]:
parking_df = pd.read_csv('../data/parking_df.csv')

In [4]:
parking_df.head()

Unnamed: 0,date_of_infraction,infraction_code,infraction_description,set_fine_amount,time_of_infraction,location1,location2,location3,location4,province,datetime_of_infraction
0,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,1546 BLOOR ST W,,,ON,2016-12-30 16:37:00
1,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,5418 YONGE ST,,,ON,2016-12-30 16:37:00
2,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,OPP,777 QUEEN ST W,,,ON,2016-12-30 16:37:00
3,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,747 QUEEN ST E,,,ON,2016-12-30 16:37:00
4,2016-12-30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,16:37:00,N/S,3042 DUNDAS ST W,,,ON,2016-12-30 16:37:00


In [5]:
parking_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13353939 entries, 0 to 13353938
Data columns (total 11 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   date_of_infraction      object 
 1   infraction_code         float64
 2   infraction_description  object 
 3   set_fine_amount         int64  
 4   time_of_infraction      object 
 5   location1               object 
 6   location2               object 
 7   location3               object 
 8   location4               object 
 9   province                object 
 10  datetime_of_infraction  object 
dtypes: float64(1), int64(1), object(9)
memory usage: 1.1+ GB


In [7]:
parking_df.isna().sum()

date_of_infraction               0
infraction_code                  2
infraction_description           0
set_fine_amount                  0
time_of_infraction               0
location1                        0
location2                        0
location3                 12376426
location4                 12374286
province                         3
datetime_of_infraction           0
dtype: int64

<a id='geoapi'></a>
## Geocoders  API

I want to find a way to produce geospatial analysis for my machine learning models. With that, I need geo coordinates `latitude` and `longitude`. 

We're going to use this code snippet to change address to a coordinate. However, we need to run this geolocator a lot of times.  

In [8]:
import pandas as pd
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="toronto-parking-application")
location = geolocator.geocode("Edward St Toronto ON Canada")
print(location.address)
print((location.latitude, location.longitude))

Edward Street, Discovery District, University—Rosedale, Old Toronto, Toronto, Golden Horseshoe, Ontario, M5B 1R7, Canada
(43.6566007, -79.3831637)


Recall the shape. That is a lot of rows to process, about 13 million, even after cleaning this is going to take a while. We need to find a way to reduce the amount of API calls!

In [10]:
parking_df.shape

(13353939, 11)

Let's get the unique values of the addresses, and apply those addresses onto any duplicate addresses. Recall the addresses with the most tickets. Some have about 30 thousand tickets. To reduce, we get the unique value, add to dictionary and find a way to use that same address after getting the coordinate. 

In [13]:
parking_df['location2'].nunique()

482566

About 400k unique values. That is a lot less than the number of rows. We can use these addresses and process them onto the API. 

In [15]:
parking_df['location2'].unique()

array(['1546 BLOOR ST W', '5418 YONGE ST', '777 QUEEN ST W', ...,
       '28 LAMBERTON BLVD', '30151 GLENDALE AVE', '576 FORMAN AVE'],
      dtype=object)

Let's apply the code snippet from before to get the coordinates of these unique values. 

In [None]:
from geopy.geocoders import Nominatim
import time
import json
from functools import lru_cache

@lru_cache(maxsize=None)
# Function to geocode address
def geocode_address(address):
    geolocator = Nominatim(user_agent="toronto-parking-application")
    location = geolocator.geocode(address, timeout=10)
    if location:
        return location.latitude, location.longitude
    else:
        return None, None
    
# get the unique values of locations
unique_addresses = parking_df['location2'].str.lower().unique()

# create dict, check if json file already exists
try:
    with open('address_data.json', 'r') as json_file:
        print(f'found file {json_file}')
        address_dict = json.load(json_file)
except FileNotFoundError:
    print('file DNE')
    # if the file does not exist, initialize an empty dictionary
    address_dict = {}

#set counter 
counter = 0 

try:
    # loop through addresses
    for address in unique_addresses:
        address += ', toronto, on, canada'
        # check if address is not already in the dictionary
        if address not in address_dict.keys():
            try:
                # add the address to the dictionary 
                address_dict[address] = geocode_address(address)
                counter += 1

                # keep a counter, if counter at 100, save dict to json and counter to 0, otherwise increase counter
                if counter == 100:
                    print('Processed 100 rows')
                    # Save the dictionary to a JSON file
                    with open('address_data.json', 'a') as json_file:
                        json.dump(address_dict, json_file)

                    # Reset the counter
                    counter = 0

                    #sleep for second 
                    time.sleep(1)

            except Exception as e:
                print(f"An error occurred: {e}")
                print("Saving the collected data to a JSON file.")
                with open('address_data.json', 'a') as json_file:
                    json.dump(address_dict, json_file)
                raise
                
except KeyboardInterrupt:
    print("Keyboard Interrupt.")
    print("Saving the collected data to a JSON file.")
    with open('address_data.json', 'a') as json_file:
        json.dump(address_dict, json_file)
    raise

In [13]:
import json
f = open('address_data.json')
 
# returns JSON object as 
# a dictionary
data = json.load(f)
 
# Iterating through the json
# list
for i in data['emp_details']:
    print(i)
 
f.close()

found file <_io.TextIOWrapper name='address_data.json' mode='r' encoding='UTF-8'>


JSONDecodeError: Extra data: line 1 column 1069314 (char 1069313)