## Wikipedia scrape notebook - Toronto postal codes

In [2]:
import numpy as np 
import pandas as pd 
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [3]:
wiki_page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [4]:
#query the website and return the html to the variable ‘page’
page = urlopen(wiki_page)
soup = BeautifulSoup(page, 'html.parser') #store in variable `soup`

Now that we have wiki URL web page parsed and stored in bfSoup we can now extract and convert into dataframe

In [5]:
#extract table and convert into dataframe
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))[0]
df=pd.DataFrame(df)
header = df.iloc[0]
df = df[1:]
df = df.rename(columns = header)
df.head(15)

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned
10,M8A,Not assigned,Not assigned


Replace not assigned neighborhoods with Borough Names, rows wich has duplicate value of Postcode will be combined into one row.

In [6]:
df = df[df.Borough != 'Not assigned']
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


In [7]:
df['Neighbourhood'] = df.apply(lambda row: row['Borough'] if (row['Neighbourhood']=='Not assigned') else row['Neighbourhood'],axis=1)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


In [8]:
df_grp = df.groupby(['Postcode','Borough'], sort=False)['Neighbourhood'].apply(','.join).reset_index()
df_grp.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


Above dataframe shows dataframe of the postal code of each neighborhood along with the borough name and neighborhood name.

In [9]:
df_grp.shape

(103, 3)