# Getting the coordinates

In the previous [notebook](scraper.ipynb) we scraped Toronto neighboring data from [this](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) wikipedia page.

The second step towards our clustering of the neighborhoods consists in finding the latitude and longitude for each neighborhood thorugh the use of a geocoder. 

### Importing libraries
In order to find the coordinates we will be using the <code>geopy</code> module.

In [1]:
import pandas as pd

from geopy.geocoders import Nominatim

### Loading the neighborhood dataset

We then import the dataset we scraped previously.

In [2]:
df = pd.read_csv('data/toronto_neighborhoods_raw.csv')
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Finding the coordinates

Let's start finding the coordinates. We will be using <code>Nominatim</code> which is a geocoder for <b>OpenStreetMap</b> data. 

In [None]:
locator  = Nominatim(user_agent="toronto_neighborhood_explorer")

postal_codes = df['PostalCode']
postal_codes_dict = dict()

for postal_code in postal_codes:    
    try:
        search_str = '{}, Toronto, Ontario'.format(postal_code) 
        location = locator.geocode(search_str)
        latitude = location.latitude
        longitude = location.longitude
        postal_codes_dict[postal_code] = (latitude, longitude)
    except:
        continue
        
print('Number of postal codes in data:', len(postal_codes))
print('Number of coordinates found:', len(postal_codes_dict))
print('Missing coordinates:', len(postal_codes)-len(postal_codes_dict))

The geocoder was only able to find 24 of the coordinates we needed. With <code>Nominatim</code> being one of the few free options it does not offer the capabilities of some paid providers.

A *.csv* file with the neighborhood coordinates was provided in the project assignment, we are going to be loading that to then merge the coordinates to our <code>DataFrame</code>.

### Loading the coordinates

In [9]:
coord = pd.read_csv('data/geospatial_coordinates.csv', names=['PostalCode', 'Latitude', 'Longitude'])
coord

Unnamed: 0,PostalCode,Latitude,Longitude
0,Postal Code,Latitude,Longitude
1,M1B,43.8066863,-79.1943534
2,M1C,43.7845351,-79.1604971
3,M1E,43.7635726,-79.1887115
4,M1G,43.7709921,-79.2169174
...,...,...,...
99,M9N,43.706876,-79.5181884
100,M9P,43.696319,-79.5322424
101,M9R,43.6889054,-79.5547244
102,M9V,43.7394164,-79.5884369


### Merging neighborhoods and coordinates

In [10]:
merged = pd.merge(df, coord, on='PostalCode')
merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7532586,-79.3296565
1,M4A,North York,Victoria Village,43.7258823,-79.3155716
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6542599,-79.3606359
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.4647633
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6623015,-79.3894938


### Saving the dataset

In [11]:
merged.to_csv('data/toronto_neighborhoods.csv', index=False)

### Conclusion

After retrieving the coordinates and adding them to the neighborhood dataframe we are now ready for segmenting and clustering. This is done in [clustering.ipynb](clustering.ipynb).