In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

First, I scraped the data from Wikipedia page [List of postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M), and converted the table on that page into a pandas DataFrame.

In [2]:
# Get the source code from Wikipedia
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

# Find the table of neighbourhoods in Toronto
table = soup.table.text
# Convert the strings into a nested list
rows = [[r for r in row.split('\n') if r] for row in filter(None, table.split('\n\n'))]

# Convert the nested list into a pandas DataFrame
Toronto_neighborhood_raw = pd.DataFrame(rows[1:], columns=rows[0])
Toronto_neighborhood_raw.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Rows with 'Not assigned' as the values of Borough are not useful, and I got rid of them here.

In [3]:
# Drop all the rows with 'Not assigned' as the value of Borough
Toronto_neighborhood = Toronto_neighborhood_raw[Toronto_neighborhood_raw['Borough']!='Not assigned']
Toronto_neighborhood.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


We can see there are several neighbourhoods with the same postcode. Now I grouped them together and aggregated the neighbourhoods into strings, with commas seperating them.

In [4]:
# Group the dataframe by Postcode, aggregate the neighbourhoods
Toronto_grouped = Toronto_neighborhood.groupby('Postcode')
Toronto_grouped = Toronto_grouped.agg({'Borough': np.unique, 'Neighbourhood': ', '.join}).reset_index()
Toronto_grouped = Toronto_grouped[['Postcode', 'Borough', 'Neighbourhood']]
Toronto_grouped.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


There are some neighbourhoods in the dataframe that don't have assigned values, we can assign the borough values to them, using pandas' 'where' function.

In [5]:
# If the value in Neighbourhood column is not 'Not assigned', leave it alone, otherwise use the value in Borough column
Toronto_neighbourhood_cleaned = Toronto_grouped.where(Toronto_grouped.Neighbourhood!='Not assigned', Toronto_grouped.Borough, axis=1)
print('{} neighbourhood(s) not assigned'.format(sum(Toronto_neighbourhood_cleaned.Neighbourhood == 'Not assigned')))

0 neighbourhood(s) not assigned


In [6]:
Toronto_neighbourhood_cleaned.shape

(103, 3)

Here I saved this pandas dataframe to a csv file, which you can find within the same folder.

In [8]:
Toronto_neighbourhood_cleaned.to_csv('Toronto_Neighbourhood')