# Toronto Neighbourhoods - geocodes
#### This is part of the Course [<u>*Applied Data Science Capstone*</u>](https://www.coursera.org/learn/applied-data-science-capstone/) on Coursera, to complete the Specialization <u>*IBM Data Science Professional Certificate*</u>

This exercise is to get the geocodes from the Toronto Neighbourhoods we got in the first [notebook](https://github.com/rareal/Coursera_Capstone/blob/master/Toronto_Neighborhoods.ipynb). We're getting the latitude and longitude from the postcodes in the dataframe. 
The instructions suggest to use the `geocoder` package, but that is not working. After looking around I found https://my.locationiq.com, got a developer token, 10000 free calls a day. **LocationIQ** [api docs](https://locationiq.com/docs#forward-geocoding).


[Other googlemaps alternatives](http://geoawesomeness.com/google-maps-api-alternatives-best-cheap-affordable/).

---------------
Importing dependencies:

In [1]:
import requests
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np
import time

In [34]:
# I saved my locationIQ api key in a local file, which is included in my .gitignore so it does not get exported to github.
with open('locationiq_api_key') as locationiq_api_key:
    tmp = locationiq_api_key.read()
    apikey = tmp.replace("\n","")

LocationIQ search api call example:

In [40]:
# Search / Forward Geocoding url
search_url = "https://us1.locationiq.com/v1/search.php"
data = {
    'key': apikey,
    'q': 'Empire State Building',
    'format': 'json'
}
response = requests.get(search_url, params=data)

In [41]:
print(response.json()[0]['display_name'])
print('latitude: ',response.json()[0]['lat'])
print('longitude: ',response.json()[0]['lon'])

Empire State Building, 350, 5th Avenue, Korea Town, Midtown South, Manhattan, Manhattan Community Board 5, New York County, New York City, New York, 10001, USA
latitude:  40.7484284
longitude:  -73.9856546198733


The API can search for postalcode directly, which is more reliable. 

Example: 

In [5]:
data = {'key': apikey,'postalcode':'M5K','countrycode':'CA','format': 'json'}
response = requests.get(search_url, params=data)
response.json()

[{'boundingbox': ['43.6469', '43.6469', '-79.3823', '-79.3823'],
  'class': 'place',
  'display_name': 'Downtown Toronto (Toronto Dominion Centre / Design Exchange), Toronto, Ontario, M5K, Canada',
  'importance': 0.1,
  'lat': '43.6469',
  'licence': '© LocationIQ.com CC BY 4.0, Data © OpenStreetMap contributors, ODbL 1.0',
  'lon': '-79.3823',
  'place_id': '75636',
  'type': 'postcode'}]

-----
#### Toronto Neighbourhoods - postcodes and geocodes
I exported the `Toronto_Neighbourhoods.csv` in the first notebook [Toronto_Neighborhoods.ipynb](https://github.com/rareal/Coursera_Capstone/blob/master/Toronto_Neighborhoods.ipynb)   
Importing into a pandas DataFrame:

In [6]:
Toronto_neigh = pd.read_csv('Toronto_Neighbourhoods.csv',index_col=[0])

Now we need to get the geocode for each postcode in the dataframe. First, let's initiate two arrays to store the data, `lat` and `lon`, filled with `'None'`.

In [7]:
nrow = len(Toronto_neigh.Postcode)
lat = pd.Series(['None']*nrow)
lon = pd.Series(['None']*nrow)

Now we loop the postcodes, get the geocode and store in the `lat` and `lon` variables.    
The API limit is 1 request per second, so it's better to include a `sleep` in the loop.  
I'm also using an error handler in case the postcode is not found.

In [8]:
for i in range(10):
    print('car',i,end='\r')
    print('plane',i,end='\r')
    time.sleep(1)

plane 9

In [9]:
for i in range(nrow):
    #print('i: ',i)
    PC = Toronto_neigh.Postcode[i]
    data = {'key': apikey,'postalcode':'{}'.format(PC),'countrycode':'CA','format': 'json'}
    try:
        response = requests.get(search_url, params=data)
        response_json = response.json()
        lat[i] = response_json[0]['lat']
        lon[i] = response_json[0]['lon']
        print('i: {}, PC: {}, Lat: {}, Lon: {}'.format(i,PC,lat[i],lon[i]),end='\r')
        time.sleep(1)    
    except Exception:
        print('i: {}, PC: {}, Lat: {}, Lon: {}'.format(i,PC,lat[i],lon[i]),end='\r')
        time.sleep(1)
        continue

i: 102, PC: M8Z, Lat: 43.6256, Lon: -79.5231

In [10]:
pd.Series(lat!='None').value_counts()

True     102
False      1
dtype: int64

One postcode could not be found.

In [11]:
Toronto_neigh[lat=='None']

Unnamed: 0,Postcode,Borough,Neighbourhood
76,M7R,Mississauga,Canada Post Gateway Processing Centre


In [12]:
data = {'key': apikey,'postalcode':'{}'.format(Toronto_neigh.Postcode[76]),'countrycode':'CA','format': 'json'}
response = requests.get(search_url, params=data)
response.json()

{'error': 'Unable to geocode'}

Trying the search query method, using the Borough and Neighbourhood names:

In [13]:
data = {'key': apikey,'q':', '.join(Toronto_neigh.iloc[76,1:3].values),'format': 'json'}
response = requests.get(search_url, params=data)
res = response.json()

In [14]:
# extracting the lat and lon
print('matches:',len(res))
for item in res:
    try:
        print('lat: {}, lon: {}, postcode: {}'.format(item['lat'],item['lon'],item['postcode']))
    except Exception:
        print('lat: {}, lon: {}, no postcode'.format(item['lat'],item['lon']))

matches: 10
lat: 43.596832, lon: -79.623997, no postcode
lat: 43.570452, lon: -79.626636, no postcode
lat: 43.569545, lon: -79.59661, no postcode
lat: 43.716528, lon: -79.637611, no postcode
lat: 43.66198, lon: -79.665466, no postcode
lat: 43.644478, lon: -79.708221, no postcode
lat: 43.639839, lon: -79.713425, no postcode
lat: 43.625576, lon: -79.676659, no postcode
lat: 43.649548, lon: -79.666832, no postcode
lat: 43.654701, lon: -79.665771, no postcode


In [15]:
# average
print('lat: ',pd.Series([x['lat'] for x in res]).astype(float).mean())
print('lon: ',pd.Series([x['lon'] for x in res]).astype(float).mean())

lat:  43.63294789999999
lon:  -79.6581228


The geocodes were provided in the assignment from this link http://cocl.us/Geospatial_data, which has the file `Geospatial_Coordinates.csv`. Importing the file to check the coordinates expected for `M7R`

In [16]:
ref_codes = pd.read_csv('Geospatial_Coordinates.csv')
ref_codes.head(3)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711


In [17]:
ref_codes[ref_codes['Postal Code']=='M7R']

Unnamed: 0,Postal Code,Latitude,Longitude
86,M7R,43.636966,-79.615819


This is similar to the average I got, from the address search:    

geocodes|mine|reference 
---|:---|---
lat|43.632948|43.636966	
lon|-79.658123|-79.615819

But I could not find a postcode to confirm. In LocationIQ there is a reverse search, to get address from geocodes. I'll run that with the codes from the reference file.

In [18]:
# Reverse Geocoding method url
reverse_url = "https://us1.locationiq.com/v1/reverse.php"
# M7R
latt=43.6369656
long=-79.615819

data = {'key': apikey,'lat': latt,'lon': long,'format': 'json'}
response = requests.get(reverse_url, params=data)

In [19]:
response.json()['address']['postcode']

'L4W 5G6'

Maybe this is because the Code in question is for the `Canada Post Gateway Processing Centre`, so it can be a special code only for that place.   
In any case, I'm updating my tabe with the reference geocodes for that postcode.

In [20]:
lat[76]=latt
lon[76]=long
print(lat[76],lon[76])

43.6369656 -79.615819


Joining in a DataFrame:

In [21]:
Toronto_neigh['Latitude']= lat.astype(float)
Toronto_neigh['Longitude']= lon.astype(float)
Toronto_neigh.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.6555,-79.3626
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.7223,-79.4504
4,M7A,Queen's Park,Queen's Park,43.6641,-79.3889


Now I will compare the geocodes I got with the ones provided, to see if they match.

In [22]:
# test if postalcodes are the same both datasets
sorted(Toronto_neigh['Postcode'].values) == sorted(ref_codes['Postal Code'].values)

True

In [23]:
# make a copy, order by the ref_codes Dataset
mine=Toronto_neigh.copy()
mine_ord = mine.set_index('Postcode').loc[ref_codes['Postal Code'].values].reset_index()

In [24]:
# latitude diff
latdiff = abs(ref_codes.Latitude.values - mine_ord.Latitude.values)

# Longitude diff
londiff = abs(ref_codes.Longitude.values - mine_ord.Longitude.values)

In [25]:
pd.DataFrame({'latdiff':latdiff,'londiff':londiff}).describe().applymap(lambda x: format(x,'.3f'))

Unnamed: 0,latdiff,londiff
count,103.0,103.0
mean,0.003,0.003
std,0.012,0.007
min,0.0,0.0
25%,0.001,0.001
50%,0.002,0.002
75%,0.003,0.004
max,0.118,0.071


So that means I can go ahead and use my dataset, as it is pretty similar to the reference.

In [26]:
# rename first column and save to a csv
Toronto_neigh_f = Toronto_neigh.rename(columns={'Postcode':'PostalCode'})
Toronto_neigh_f.to_csv('Toronto_neigh_latlon.csv')

#### Final Dataset for this part

In [27]:
Toronto_neigh_f

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.6555,-79.3626
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.7223,-79.4504
4,M7A,Queen's Park,Queen's Park,43.6641,-79.3889
5,M9A,Etobicoke,Islington Avenue,43.6662,-79.5282
6,M1B,Scarborough,"Rouge, Malvern",43.8113,-79.193
7,M3B,North York,Don Mills North,43.745,-79.359
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.7063,-79.3094
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.6572,-79.3783
