# Segmenting and Clustering Toronto Neighborhoods

## Scrape Wikipedia

In [16]:
import pandas as pd

wiki_link = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

Pandas provides a method for reading html directly into a DataFrame.

In [17]:
# Read the wikipedia tables into dataframes
dfs = pd.read_html(wiki_link)
# the first DF contains borough data
df = dfs[0]

## Scrub the Data

Now that the data is obtained, we must clean it up a bit. 

In [45]:
# The DF will consist of three columns: PostalCode, Borough and Neighborhood
df = df.rename(columns={'Postal code':'PostalCode'})
df.columns

Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')

We are tasked with removing any rows that have a borough of 'Not assigned'. This will also remove any rows that may have had a value of 'Not assigned' in neighborhood

In [49]:
# Ignore cells that have a borough 'Not assigned'
# This also captures empty Neighborhood fields
df = df.drop(labels=df.loc[df.Borough == 'Not assigned'].index)
# Reset the index
df.reset_index(drop=True, inplace=True)

Replace the characters ' / ' with ', ' to match the formatting of the provided example

In [50]:
# Use commas instead of slashes for boroughs made up of multiple
# Neighborhoods
df.Neighborhood = df.Neighborhood.apply(lambda x: x.replace(' / ', ', '))

# Use example from prompt to show it is completed
df.loc[df.PostalCode == 'M5A']

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [51]:
# Print the number of rows in our DataFrame
df.shape

(103, 3)

# Obtaining Coordinate Data

In [53]:
import geocoder

We will use the ArcGIS provider, as it seems to be the most reliable.

In [69]:
def get_lat_lon(p_code):
    print('Trying to get coordinates for {}'.format(p_code))
    lat_lon = None
    while not lat_lon:
        geo_str = '{}, Toronto'.format(p_code)
        g = geocoder.arcgis(geo_str)
        lat_lon = g.latlng
    print('Successfully got coordinates for {}'.format(p_code))
    return lat_lon

Use the applymap function to send each Postal Code to the get_lat_lon function. Please be patient, this may take well over 1 minute to obtain each Postal Code.

In [70]:
lat_lon_df = df[['PostalCode']].applymap(get_lat_lon)

Trying to get coordinates for M3A
Successfully got coordinates for M3A
Trying to get coordinates for M4A
Successfully got coordinates for M4A
Trying to get coordinates for M5A
Successfully got coordinates for M5A
Trying to get coordinates for M6A
Successfully got coordinates for M6A
Trying to get coordinates for M7A
Successfully got coordinates for M7A
Trying to get coordinates for M9A
Successfully got coordinates for M9A
Trying to get coordinates for M1B
Successfully got coordinates for M1B
Trying to get coordinates for M3B
Successfully got coordinates for M3B
Trying to get coordinates for M4B
Successfully got coordinates for M4B
Trying to get coordinates for M5B
Successfully got coordinates for M5B
Trying to get coordinates for M6B
Successfully got coordinates for M6B
Trying to get coordinates for M9B
Successfully got coordinates for M9B
Trying to get coordinates for M1C
Successfully got coordinates for M1C
Trying to get coordinates for M3C
Successfully got coordinates for M3C
Trying

Successfully got coordinates for M1C
Trying to get coordinates for M3C
Successfully got coordinates for M3C
Trying to get coordinates for M4C
Successfully got coordinates for M4C
Trying to get coordinates for M5C
Successfully got coordinates for M5C
Trying to get coordinates for M6C
Successfully got coordinates for M6C
Trying to get coordinates for M9C
Successfully got coordinates for M9C
Trying to get coordinates for M1E
Successfully got coordinates for M1E
Trying to get coordinates for M4E
Successfully got coordinates for M4E
Trying to get coordinates for M5E
Successfully got coordinates for M5E
Trying to get coordinates for M6E
Successfully got coordinates for M6E
Trying to get coordinates for M1G
Successfully got coordinates for M1G
Trying to get coordinates for M4G
Successfully got coordinates for M4G
Trying to get coordinates for M5G
Successfully got coordinates for M5G
Trying to get coordinates for M6G
Successfully got coordinates for M6G
Trying to get coordinates for M1H
Succes

We now will insert the latitude and longitude values into the original DataFrame.

In [84]:
df['Latitude'] = lat_lon_df.PostalCode.map(lambda x: x[0])
df['Longitude'] = lat_lon_df.PostalCode.map(lambda x: x[1])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.31189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939


To verify this worked correctly, lets check if our data matches with the provided example. We will see there are slight differences, but this can be assumed to be caused by using different content providers (Google vs ArcGIS)

In [86]:
df.loc[df.PostalCode.isin(['M5G', 'M2H', 'M4B', 'M1J'])]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.707193,-79.311529
24,M5G,Downtown Toronto,Central Bay Street,43.656072,-79.385653
27,M2H,North York,Hillcrest Village,43.802556,-79.356566
32,M1J,Scarborough,Scarborough Village,43.744203,-79.228725


# Exploring and Clustering Toronto Neighborhoods