# Retrieve Locations

The purpose of this notebook is to retrieve the locations of the incomplete records. To do so, the Google geo-API is used, it has a limit of 2500 queries per day. Therefore, the following algorithm will have to be able to run mutliple times on the csv containing the locations in order to archieve the retrieval.

In [1]:
# The Geocoder module is used.
from pygeocoder import Geocoder
import time
import pandas as pd
import os

In [2]:
# Files
input_file = "./Intermediate Data/Missing_Zip_Lat_Long.csv"
output_file = "./Intermediate Data/Retrieved_Zip_Lat_Long.csv"

In [3]:
# Check if file exists, if not, create it
if not os.path.exists(output_file):
    open(output_file, 'w').close()

In [4]:
def CorrectLongitudeLatitude(n,input_file,output_file):
    '''
    Goal: retrieve the zipcode, longitude, and latitude of a given pothole.
    Input: A csv file containing the missing locations.
    Output: A csv file containing the retrieved results.
    '''
    print("Retrieving missing locations using Google geo-API...")
    
    # Load dataframe and open the output file
    missing_location_df = pd.read_csv(input_file)

    # Open the connection with the ouput file as "read-only"
    output  = open(output_file, 'r')
    
    # Inspect file to determine how much work is left
    n_case_retrieved = output.read().count("\n")
    n_total_cases = missing_location_df.shape[0]

    output.close()
    output  = open(output_file, 'a')
          
    print("Total cases: "+str(n_total_cases))
    print("Work already performed on: "+str(n_case_retrieved))
    
    # Set the start and end indexes    
    start_index = n_case_retrieved
    end_index = min(start_index+n,n_total_cases)
    
    print("Fetch starts on " + str(min(n,n_total_cases-start_index)) + " items...")
    
    progress=0
    for index in range(start_index,end_index):
        
        current_index = missing_location_df.iloc[index,0]
        current_address = missing_location_df.iloc[index,1].replace("  "," ").strip()
        
        # If address starts with a 0, remove it
        if current_address[0]=="0":
            current_address = current_address[2:]
        
        # If address contains two street numbers, keep the second one
        if "-" in current_address:
            current_address = current_address.split("-")[1]
            
        # If address is an intersection, replace & by and and remove INTERSECTION OF
        if "INTERSECTION of" in current_address:
            current_address = current_address.replace("INTERSECTION of ","")
            current_address = current_address.replace('&','and')
                
        time.sleep(0.2)
        
        try:
            location = Geocoder.geocode(current_address)
        except:
            latitude,longitude,zipcode = "ERROR","ERROR","ERROR"
        try:
            latitude,longitude = location.coordinates
        except:
            latitude,longitude = "ERROR","ERROR"
        try:
            zipcode = location.postal_code
        except:
            zipcode = "ERROR"
        
        output_location = str(current_index)+","+str(current_address)+","+str(latitude)+","+str(longitude)+","+str(zipcode)+"\n"

        progress+=1
        output.write(output_location)
        if progress%50==0:
            print("Done with "+str(progress)+" records...")
            
    output.close()
    print("Search completed...")

In [5]:
CorrectLongitudeLatitude(500,input_file,output_file)

Retrieving missing locations using Google geo-API...
Total cases: 5812
Work already performed on: 5479
Fetch starts on 333 items...
Done with 50 records...
Done with 100 records...
Done with 150 records...
Done with 200 records...
Done with 250 records...
Done with 300 records...
Search completed...


The above function is used several times (by batches of 500 records) in order to avoid triggering the API limitations.