# Part 1: Getting data from Wikipedia, cleaning and formatting

First, we import Pandas and load the tables from the Wikipedia page. The table we want is the first one on the page.

In [1]:
import pandas as pd
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
df = pd.DataFrame(tables[0])

Next we 1) rename the columns, 2) drop the rows where Borough is "Not assigned", and 3) on rows where Neighborhood is "Not assigned" change it to same as Borough.

In [2]:
df.columns = ["PostalCode", "Borough", "Neighborhood"]
df.drop(df[df["Borough"] == "Not assigned"].index, inplace = True)
df.loc[df["Neighborhood"] == "Not assigned"] = df["Borough"]

Now we group the dataframe so that we have only one row per postal code and neighborhoods under that postal code are listed as a comma-separated list in the Neighborhood column.

In [3]:
nbrs = df.groupby(["PostalCode", "Borough"])["Neighborhood"].apply(lambda hoods: ", ".join(hoods)).reset_index()

Finally, we check the shape of the resulting dataframe.

In [4]:
nbrs.shape

(103, 3)

# Part 2: Adding location data

First we install and import the Geocoder library and define a function to fetch coordinates for a given postal code.

In [23]:
!conda install -c conda-forge geocoder --yes
import geocoder

def lat_long(postal_code):
    coords = None
    while(coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        # (Using arcgis instead of Google, because Google doesn't seem to work.)
        coords = g.latlng
    return coords[0], coords[1]

Solving environment: done

# All requested packages already installed.



Next, we go through the rows of the dataframe, get the latitude and longitude for each postal code, and add them to two separate lists. Finally we insert the lists into the dataframe as new columns Latitude and Longitude.

In [28]:
lat_list = []
long_list = []
counter = 1
for pc in nbrs["PostalCode"]:
    print(counter, end = '\r')
    lat, long = lat_long(pc)
    lat_list.append(lat)
    long_list.append(long)
    counter += 1
print("Done.")
nbrs["Latitude"] = lat_list
nbrs["Longitude"] = long_list

Done.


Let's check that it looks right:

In [25]:
nbrs.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.811525,-79.195517
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.785665,-79.158725
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765815,-79.175193
3,M1G,Scarborough,Woburn,43.768369,-79.21759
4,M1H,Scarborough,Cedarbrae,43.769688,-79.23944
