### Scrape Wikipedia page of Neighbourhood in Toronto

In [44]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

Create soup object of our Wiki page.

In [45]:
html = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(html).text
soup = BeautifulSoup(source, 'lxml')
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of postal codes of Canada: M - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":867606113,"wgRevisionId":867606113,"wgArticleId":539066,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","w

From the returned html structure, our interested information is in tag <tr> and <td>, extract the data and store into a list.

In [46]:
templist = []
for line in soup.find_all('tr'):
    if len(line.find_all('th')) != 0:
        templist.append([item.text.strip('\n') for item in line.find_all('th')])
    else:
        templist.append([item.text.strip('\n') for item in line.find_all('td')])
templist

[['Postcode', 'Borough', 'Neighbourhood'],
 ['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M5A', 'Downtown Toronto', 'Regent Park'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', "Queen's Park", 'Not assigned'],
 ['M8A', 'Not assigned', 'Not assigned'],
 ['M9A', 'Etobicoke', 'Islington Avenue'],
 ['M1B', 'Scarborough', 'Rouge'],
 ['M1B', 'Scarborough', 'Malvern'],
 ['M2B', 'Not assigned', 'Not assigned'],
 ['M3B', 'North York', 'Don Mills North'],
 ['M4B', 'East York', 'Woodbine Gardens'],
 ['M4B', 'East York', 'Parkview Hill'],
 ['M5B', 'Downtown Toronto', 'Ryerson'],
 ['M5B', 'Downtown Toronto', 'Garden District'],
 ['M6B', 'North York', 'Glencairn'],
 ['M7B', 'Not assigned', 'Not assigned'],
 ['M8B', 'Not assigned', 'Not assigned'],
 ['M9B', 'Etobicoke', 'Cloverdale'],
 [

Convert the templist into a pandas dataframe.

In [47]:
df = pd.DataFrame([item[0:3] for item in templist[1:-5]])
df.columns = templist[0]
print(df.shape)
df.head()

(289, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Process the dataframe, first remove all records that Borough is 'Not assigned'.

In [48]:
df = df[df.Borough != 'Not assigned'] 
print(df.shape)
df.head()

(212, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Second, for records that has Borough information, but without Neighbourhood, assign Borough to Neighbourhood.

In [49]:
temp = df[df['Neighbourhood'] =='Not assigned'].Borough
df.loc[df['Neighbourhood'] == 'Not assigned', 'Neighbourhood']=temp
df = df.reset_index(drop=True)
print(df.shape)
df.head()

(212, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Third, group records that have the same postcode and Borough into one row with the neighbourhoods, comma delimited.

In [50]:
temp = df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(lambda x: '%s' % ','.join(x))
df1 = pd.DataFrame(temp.index.get_level_values(0))
df2 = pd.DataFrame(temp.index.get_level_values(1))
df3 = pd.DataFrame(temp.values)
df4 = pd.concat([df1, df2, df3], axis=1)
df4.columns = ['Postcode', 'Borough', 'Neighbourhood']
pd.set_option('display.max_colwidth', -1)
# df4[df4.Postcode == 'M9V']


To exam if rows are correctly combined, check one postcode that has different neighbourhoods, M9V.

In [51]:
print(df4[df4.Postcode == 'M9V']['Neighbourhood'])
df[df.Postcode == 'M9V']

101    Albion Gardens,Beaumond Heights,Humbergate,Jamestown,Mount Olive,Silverstone,South Steeles,Thistletown
Name: Neighbourhood, dtype: object


Unnamed: 0,Postcode,Borough,Neighbourhood
174,M9V,Etobicoke,Albion Gardens
175,M9V,Etobicoke,Beaumond Heights
176,M9V,Etobicoke,Humbergate
177,M9V,Etobicoke,Jamestown
178,M9V,Etobicoke,Mount Olive
179,M9V,Etobicoke,Silverstone
180,M9V,Etobicoke,South Steeles
181,M9V,Etobicoke,Thistletown


Let's see the final data frame size

In [52]:
print(df4.shape)
df4.head()

(103, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
