## Segmenting and Clustering Neighborhoods in Toronto

In [1]:
import pandas as pd

First task is to parse data from Wikipedia:

In [2]:
from IPython.display import IFrame
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
IFrame(url, width=800, height=350)

Data can be parsed using `BeautifulSoap`, but it's more straightforward just to use `Pandas` and its function `read_html`:

In [3]:
data, = pd.read_html(url, match="Postal Code", skiprows=1)
data.columns = ["PostalCode", "Borough", "Neighborhood"]
data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M2A,Not assigned,Not assigned
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,"Regent Park, Harbourfront"
4,M6A,North York,"Lawrence Manor, Lawrence Heights"


Only process the cells that have an assigned borough. Ignore cells with a borough that is "Not assigned".

In [4]:
data = data[data["Borough"] != "Not assigned"]
data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,"Regent Park, Harbourfront"
4,M6A,North York,"Lawrence Manor, Lawrence Heights"
5,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

Solution is to group data by `PostalCode` and aggregate columns. For borough, it's sufficient to pick first item from the resulting series and for neighbourhood, items are joined together using `", ".join(s)`:

In [5]:
borough_func = lambda s: s.iloc[0]
neighborhood_func = lambda s: ", ".join(s)
agg_funcs = {"Borough": borough_func, "Neighborhood": neighborhood_func}
data_temp = data.groupby(by="PostalCode").aggregate(agg_funcs)
data_temp.head()

Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Malvern, Rouge"
M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
M1E,Scarborough,"Guildwood, Morningside, West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


Some postprocessing is needed; reset the index and add columns back to right order:

In [6]:
data = data_temp.reset_index()[data.columns]
data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [7]:
data[data["Neighborhood"] == "Not assigned"]

Unnamed: 0,PostalCode,Borough,Neighborhood


We can, for example, iterate through table and replace the values:

In [8]:
for (j, row) in data.iterrows():
    if row["Neighborhood"] == "Not assigned":
        borough = row["Borough"]
        print("Replace \"Not assigned\" => %s in row %i" % (borough, j))
        row["Neighborhood"] = borough

To check data, examine row 85, which should be the only changed one:

In [9]:
data.iloc[83:88]

Unnamed: 0,PostalCode,Borough,Neighborhood
83,M6R,West Toronto,"Parkdale, Roncesvalles"
84,M6S,West Toronto,"Runnymede, Swansea"
85,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
86,M7R,Mississauga,Canada Post Gateway Processing Centre
87,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."


The size of the data is now:

In [10]:
data.shape

(103, 3)

## Determining coordinates for each neigbourhood

In [11]:
# import sys
# !conda install -c conda-forge geopy --yes --prefix {sys.prefix}
# !conda install -c conda-forge folium=0.5.0 --yes --prefix {sys.prefix}
# !conda install -c conda-forge geocoder --yes --prefix {sys.prefix}

Using geocoder with google service results `OVER_QUERY_LIMIT`: Keyless access to Google Maps Platform is deprecated. Please use an API key with all your API calls to avoid service interruption. For further details please refer to http://g.co/dev/maps-no-account. It seems to be quite hard to fetch location data from internet without api keys, so instead use the csv file approach this time:

In [12]:
locations = pd.read_csv("https://cocl.us/Geospatial_data")
locations.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [13]:
locations.columns = ["PostalCode", "Latitude", "Longitude"]
neighborhoods = pd.merge(data, locations, on='PostalCode')
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


To check that merging was succesfull, find the first postal code `M5G` which should be `(43.657952, -79.387383)`:

In [14]:
neighborhoods[neighborhoods["PostalCode"] == "M5G"]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
57,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383


In [15]:
neighborhoods[neighborhoods["PostalCode"] == "M5A"]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
53,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636


The ordering of the dataframe in assignment is unknown but clearly we have correct latitude and longitude now attached for each postal code.

In [16]:
# create map
import folium
from geopy.geocoders import Nominatim

address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.
