### Scrape Wikipedia page, parse table of postal codes and neighborhoods, reformat entries and create Pandas dataframe.

In [196]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [197]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [198]:
soup = BeautifulSoup(source,'lxml')

In [199]:
table = soup.table 

In [200]:
table_body = table.find('tbody')

In [201]:
rows = table_body.find_all('tr')

So far, we have used BeautifulSoup to parse the source code from the Wikipedia page on the html 'tbody' tag, and then parsed on the 'tr' tag which defines table rows.

In [202]:
data = []
for row in rows:
    col = row.find_all('td')
    col = [e.text.strip() for e in col]
    data.append([e for e in col if e])
data = [item for item in data if item]

In [203]:
data = [item for item in data if item[1] != 'Not assigned']

The above codeblock loops over each row of the table and parses on the 'td' tag, then strips all but the text string and appends to a list. This creates a list where each entry is a row of the table. We then use a list comprehension to filter out empty lists (first entry was empty, for instance). Finally, we access the second index of each item in the list and filter out items where index 1 has the string value 'Not assigned'; this is to remove table entries where the Borough is not assigned, as instructed. 

In [204]:
df = pd.DataFrame(data)

In [205]:
names = ['Postal code','Borough','Neighborhood']
df.columns = names

In [206]:
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


The next step is to replace all instances of slash (/) separators with commas(,). After this, we check if any dataframe entries have assigned Boroughs but unassigned Neighborhoods, and if so we will assign the Neighborhood to be the same as the Borough.

In [207]:
df['Neighborhood'] = df.apply(lambda row: row['Neighborhood'].replace(' /',','),axis=1)
df[df['Neighborhood'] == "Not assigned"]

Unnamed: 0,Postal code,Borough,Neighborhood


In [208]:
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [209]:
print("Number of rows: " + str(df.shape[0]))

Number of rows: 103
