# Geolocation

## Objective

in the 2019 version of the Aqueduct Water Risk Atlas, users should be able to 
upload excel files with addresses. A service will geolocate these adresses and 
return a coordinate. 

Criteria:
1. Secure
1. Fast (specify!)
1. Handle up to 1.000 locations at a time. (cost 5$ for geocoding using Goolge)
1. Proper  error handling
1. Include a sense of the quality of the match.

- For now just use template with street name, city, country etc. 


## Google
###  API default limits

Requests per day: Unlimited  
Requests per 100 seconds: 5000 (50/s,  ex burst)  
Requests per 100 seconds per user: Unlimited  

You can change these settings and we will have to set them appropriately. 

Google API prices:
https://developers.google.com/maps/documentation/geocoding/usage-and-billing

0–100,000 | 0.005 USD per each | (5.00 USD per 1000)
100,001–500,000 | 0.004 USD per each | (4.00 USD per 1000)
500,000+ | contact sales

###How should I format my geocoder queries to maximise the number of successful requests?
The geocoder is designed to map street addresses to geographical coordinates. We therefore recommend that you format geocoder requests in accordance with the following guidelines to maximize the likelihood of a successful query:

Specify addresses in accordance with the format used by the national postal service of the country concerned.
Do not specify additional address elements such as business names, unit numbers, floor numbers, or suite numbers that are not included in the address as defined by the postal service of the country concerned. Doing so may result in responses with ZERO_RESULTS.
Use the street number of a premise in preference to the building name where possible.
Use street number addressing in preference to specifying cross streets where possible.
Do not provide 'hints' such as nearby landmarks.

### How should I format a U.S. address on a numbered highway for geocoding?
The Google Maps Platform geocoder requires that U.S. numbered highways be specified in addresses as follows:

County Roads: "Co Road NNN" where NNN is the road number. eg. "Co Road 82"  
State Highways: "State NNN" where State is the full name of the state and NNN is the highway number. eg. "California 82"  
U.S. Highways: "U.S. NNN" where NNN is the highway number. eg. "U.S. 101"  
U.S. Interstates: "Interstate NNN" where NNN is the interstate number. eg. "Interstate 280"  


## Open street maps
### API limits

https://operations.osmfoundation.org/policies/nominatim/

requests per second: 1

## Carto
todo

## Mapbox
todo

## ESRI
todo

## Many alternatives exists
would probably pay off to explore however not my priority right now


In [2]:
import geopy
import getpass
import pandas as pd
from tqdm import tqdm
from urllib.request import urlopen

In [3]:
tqdm.pandas()
from geopy.geocoders import GoogleV3, Nominatim
from geopy.extra.rate_limiter import RateLimiter

In [4]:
api_key = getpass.getpass()

 ·······································


In [5]:
sample_address_URL = "https://gist.githubusercontent.com/rutgerhofste/2cd0d4868d25a9b3518b359d6e9f56ae/raw/c4c6db100a2caf1615baf07f280b0dbdc508114e/sample_addresses"
df = pd.read_csv(sample_address_URL)
df = df[["Street Address", 'Country']]

In [6]:
df.iloc[:50].to_csv('sample_address.csv')

In [7]:
df["address"] = df["Street Address"] + ", " + df["Country"]
df.head()

Unnamed: 0,Street Address,Country,address
0,"B/10,Nandkishore industrial Estate, off Mahaka...",India,"B/10,Nandkishore industrial Estate, off Mahaka..."
1,9626 Telstar Avenue,USA,"9626 Telstar Avenue, USA"
2,"MOLI AMAT, S/N (RIU RIPOLL) 08208-SABADELL-",Spain,"MOLI AMAT, S/N (RIU RIPOLL) 08208-SABADELL-, S..."
3,URB Quinta Tristan Z4-11 - Jose Bustamante,Peru,"URB Quinta Tristan Z4-11 - Jose Bustamante, Peru"
4,72 SHANKBRIDGE ROAD,N.IRELAND,"72 SHANKBRIDGE ROAD, N.IRELAND"


In [8]:
rate_limit =  5000/(100+1) #5000 requests per second + buffer

In [9]:
g = GoogleV3(api_key=api_key)
geolocator = RateLimiter(g.geocode,
                         min_delay_seconds=0.0, 
                         #min_delay_seconds=1/rate_limit, 
                         max_retries=5)

In [10]:
def get_lat(geopy_location):
    try:
        lat = geopy_location.latitude
    except:
        #No match
        lat = None
    return lat

def get_lon(geopy_location):
    try:
        lat = geopy_location.longitude
    except:
        #No match
        lat = None
    return lat

def get_raw(geopy_location):
    try:
        geopy_location.raw
        raw_json = True
    except:
        #No match
        raw_json = False
    return raw_json

In [11]:
df['geolocate'] = df['address'].progress_apply(geolocator)
df["lat"] = df['geolocate'].apply(get_lat)
df["lng"] = df['geolocate'].apply(get_lon)
df["Match"] = df['geolocate'].apply(get_raw)

100%|██████████| 53/53 [00:27<00:00,  1.95it/s]


In [12]:
df.head(10)

Unnamed: 0,Street Address,Country,address,geolocate,lat,lng,Match
0,"B/10,Nandkishore industrial Estate, off Mahaka...",India,"B/10,Nandkishore industrial Estate, off Mahaka...","(Unit 8-9-10-11-12, Ground Floor, B Wing, Clas...",19.119586,72.861379,True
1,9626 Telstar Avenue,USA,"9626 Telstar Avenue, USA","(9626 Telstar Ave, El Monte, CA 91731, USA, (3...",34.068835,-118.060331,True
2,"MOLI AMAT, S/N (RIU RIPOLL) 08208-SABADELL-",Spain,"MOLI AMAT, S/N (RIU RIPOLL) 08208-SABADELL-, S...","(Camí Molí de l'Amat, 0, 08208 Sabadell, Barce...",41.566764,2.108632,True
3,URB Quinta Tristan Z4-11 - Jose Bustamante,Peru,"URB Quinta Tristan Z4-11 - Jose Bustamante, Peru","(Urb. Quinta Tristan, José Luis Bustamante y R...",-16.424644,-71.530683,True
4,72 SHANKBRIDGE ROAD,N.IRELAND,"72 SHANKBRIDGE ROAD, N.IRELAND","(72 Shankbridge Rd, Kells, Ballymena BT42 3DL,...",54.814395,-6.246017,True
5,1460-1514 Jin Long Da Dao,China,"1460-1514 Jin Long Da Dao, China",,,,False
6,HACI SABANCI ORGAN?ZE SANAY?? BÖLGES? TURGUT Ö...,Turkey,HACI SABANCI ORGAN?ZE SANAY?? BÖLGES? TURGUT Ö...,"(Cihadiye, Turgut Özal Blv. No:2, 01790 Acıder...",36.983129,35.596071,True
7,VIA FLAMINIA NORD 48 CAGLI,ITALY,"VIA FLAMINIA NORD 48 CAGLI, ITALY","(Via Flaminia Nord, 48, 61043 Cagli PU, Italy,...",43.581691,12.670974,True
8,"Xia Keng Industrial Zone, Xia Keng, Chang Ping",China,"Xia Keng Industrial Zone, Xia Keng, Chang Ping...","(Changpingzhen, Dongguan, Guangdong, China, (2...",22.974855,113.993116,True
9,"No. 108, Shuichang Road 5, Shuikou Industrial ...",China,"No. 108, Shuichang Road 5, Shuikou Industrial ...",,,,False


In [13]:
df[df['Match'] == False]

Unnamed: 0,Street Address,Country,address,geolocate,lat,lng,Match
5,1460-1514 Jin Long Da Dao,China,"1460-1514 Jin Long Da Dao, China",,,,False
9,"No. 108, Shuichang Road 5, Shuikou Industrial ...",China,"No. 108, Shuichang Road 5, Shuikou Industrial ...",,,,False
16,HUANGNING ROAD 29 YUANGHUANG ECONOMIC DEVELOPM...,China,HUANGNING ROAD 29 YUANGHUANG ECONOMIC DEVELOPM...,,,,False
18,605 RED FEND ROAD,China,"605 RED FEND ROAD, China",,,,False
24,No. 218 Xigan Road,China,"No. 218 Xigan Road, China",,,,False
31,No.388 Shitai road,China,"No.388 Shitai road, China",,,,False
43,"NO.1 Guandong Road,Chengbei Economic Develop ...",China,"NO.1 Guandong Road,Chengbei Economic Develop ...",,,,False
51,this is bullshit,wnvcekrjvnekj,"this is bullshit, wnvcekrjvnekj",,,,False
52,f3rbtnb,ervtrhbtvr,"f3rbtnb, ervtrhbtvr",,,,False


## Running in parallel

In [68]:
from multiprocessing import Pool
import time

In [69]:
sample_address_URL = "https://gist.githubusercontent.com/rutgerhofste/2cd0d4868d25a9b3518b359d6e9f56ae/raw/c4c6db100a2caf1615baf07f280b0dbdc508114e/sample_addresses"
df = pd.read_csv(sample_address_URL)
df = df[["Street Address", 'Country']]
df["address"] = df["Street Address"] + ", " + df["Country"]
df.head()

Unnamed: 0,Street Address,Country,address
0,"B/10,Nandkishore industrial Estate, off Mahaka...",India,"B/10,Nandkishore industrial Estate, off Mahaka..."
1,9626 Telstar Avenue,USA,"9626 Telstar Avenue, USA"
2,"MOLI AMAT, S/N (RIU RIPOLL) 08208-SABADELL-",Spain,"MOLI AMAT, S/N (RIU RIPOLL) 08208-SABADELL-, S..."
3,URB Quinta Tristan Z4-11 - Jose Bustamante,Peru,"URB Quinta Tristan Z4-11 - Jose Bustamante, Peru"
4,72 SHANKBRIDGE ROAD,N.IRELAND,"72 SHANKBRIDGE ROAD, N.IRELAND"


In [70]:
def get_latlonraw(x, g):
    index, row = x
    time.sleep(0.05)
    address = g.geocode(row['address'])        
    try:
        return address.latitude, address.longitude, True
    except:
        return None, None, False


In [71]:
%%time
df1 = pd.DataFrame(0.0, index=list(range(0,len(df))), columns=list(['match_adress', 'lat','lon', 'match']))
df = pd.concat([df,df1], axis=1)

g = GoogleV3(api_key=api_key)

def get_latlonraw(x):
    index, row = x
    time.sleep(0.05)
    address = g.geocode(row['address'])

    try:
        return address.address, address.latitude, address.longitude, True
    except:
        return None, None, None, False

    

p = Pool()
df[['match_adress', 'lat', 'lon', 'match']] = p.map(get_latlonraw, df.iterrows())

CPU times: user 52.2 ms, sys: 51 ms, total: 103 ms
Wall time: 3.04 s


In [65]:
df.head(10)

Unnamed: 0,Street Address,Country,address,match_adress,lat,lon,match
0,"B/10,Nandkishore industrial Estate, off Mahaka...",India,"B/10,Nandkishore industrial Estate, off Mahaka...","Unit 8-9-10-11-12, Ground Floor, B Wing, Class...",19.119586,72.861379,True
1,9626 Telstar Avenue,USA,"9626 Telstar Avenue, USA","9626 Telstar Ave, El Monte, CA 91731, USA",34.068835,-118.060331,True
2,"MOLI AMAT, S/N (RIU RIPOLL) 08208-SABADELL-",Spain,"MOLI AMAT, S/N (RIU RIPOLL) 08208-SABADELL-, S...","Camí Molí de l'Amat, 0, 08208 Sabadell, Barcel...",41.566764,2.108632,True
3,URB Quinta Tristan Z4-11 - Jose Bustamante,Peru,"URB Quinta Tristan Z4-11 - Jose Bustamante, Peru","Urb. Quinta Tristan, José Luis Bustamante y Ri...",-16.424644,-71.530683,True
4,72 SHANKBRIDGE ROAD,N.IRELAND,"72 SHANKBRIDGE ROAD, N.IRELAND","72 Shankbridge Rd, Kells, Ballymena BT42 3DL, UK",54.814395,-6.246017,True
5,1460-1514 Jin Long Da Dao,China,"1460-1514 Jin Long Da Dao, China",,,,False
6,HACI SABANCI ORGAN?ZE SANAY?? BÖLGES? TURGUT Ö...,Turkey,HACI SABANCI ORGAN?ZE SANAY?? BÖLGES? TURGUT Ö...,"Cihadiye, Turgut Özal Blv. No:2, 01790 Acıdere...",36.983129,35.596071,True
7,VIA FLAMINIA NORD 48 CAGLI,ITALY,"VIA FLAMINIA NORD 48 CAGLI, ITALY","Via Flaminia Nord, 48, 61043 Cagli PU, Italy",43.581691,12.670974,True
8,"Xia Keng Industrial Zone, Xia Keng, Chang Ping",China,"Xia Keng Industrial Zone, Xia Keng, Chang Ping...","Changpingzhen, Dongguan, Guangdong, China",22.974855,113.993116,True
9,"No. 108, Shuichang Road 5, Shuikou Industrial ...",China,"No. 108, Shuichang Road 5, Shuikou Industrial ...",,,,False


In [126]:
df[df['match'] == False]

Unnamed: 0,Street Address,Country,address,lat,lon,match
9,"No. 108, Shuichang Road 5, Shuikou Industrial ...",China,"No. 108, Shuichang Road 5, Shuikou Industrial ...",,,False
18,605 RED FEND ROAD,China,"605 RED FEND ROAD, China",,,False
20,No. 7 Renmin Road Miaoqiao Twon,China,"No. 7 Renmin Road Miaoqiao Twon, China",,,False
24,No. 218 Xigan Road,China,"No. 218 Xigan Road, China",,,False
31,No.388 Shitai road,China,"No.388 Shitai road, China",,,False
48,"Qin Shi Industrial Street,Sanzao Scientific an...",China,"Qin Shi Industrial Street,Sanzao Scientific an...",,,False
50,"Gushu 1 Road, 93A",China,"Gushu 1 Road, 93A, China",,,False
51,this is bullshit,wnvcekrjvnekj,"this is bullshit, wnvcekrjvnekj",,,False
52,f3rbtnb,ervtrhbtvr,"f3rbtnb, ervtrhbtvr",,,False


## Sample address file

In [15]:
URL = "http://geocode.xyz/Example_set.txt"
df  = pd.read_csv(filepath_or_buffer=URL,
                    header=None,
                    sep='\t',
                    names=["address"],
                    encoding="UTF-8")

In [16]:
df.head()

Unnamed: 0,address
0,"62 Another Development - AD Sanchaung, Yangon ..."
1,"MOOSMATTENSTR. 25,79117 FREIBURG"
2,"19555 ,Shoemakersville, PA"
3,"Vintergatan 50, Skelleftehamn"
4,"110 W Tehachapi Blvd Tehachapi, CA 93561-1632,..."


In [17]:
df.iloc[90:100].reset_index(drop=True).to_csv('sample_address.csv')

In [220]:
df.iloc[90:100].reset_index(drop=True).to_excel('sample_address.xlsx')