# Toronto Neightbourhood
## Web scraping

First, we import modules (BeautifulSoup, requests, pandas).

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

Here we scrape the table data from the url with BeautifulSoup and obtain HTML tables containing zip code, borough and neighbourhood.

In [2]:
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M"
data  = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")
table = soup.find('table')
table_rows=table.find_all('tr')

Next, we extract data from each cell, where some data need to be cleaned by using regex.

In [3]:
zipcode = []
borough = []
neighbourhood = []
for row in table_rows:
    inside = row.find_all('span')
    for temp in inside:
        if temp.i == None:
            z = temp.parent.b.text
            zipcode.append(z)
            text = temp.get_text()
            pattern1 = '^[^\(]+'
            pattern2 = '\(([^)]+)\)'
            b = re.findall(pattern1, text)[0]
            n = re.findall(pattern2, text)[0]
            n = n.replace( ' /',',' )
            borough.append(b)
            neighbourhood.append(n)

In [4]:
data = {'zipcode': zipcode, 'borough': borough, 'neighbourhood':neighbourhood}
df = pd.DataFrame(data)
df

Unnamed: 0,zipcode,borough,neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Some of the entries need to be fixed individually.

In [5]:
df.at[35,'borough'] = 'East York'
df.at[76,'borough'] = 'Mississauga'
df.at[76,'neighbourhood'] = 'Canada Post Gateway Processing Centre'
df.at[92,'borough'] = 'Downtown Toronto'
df.at[92,'neighbourhood'] = 'The Esplanade'
df.at[94,'borough'] = 'Etobicoke'
df.at[100,'borough'] = 'East Toronto'
df.at[100,'neighbourhood'] = 'Business reply mail Processing Centre'

In [6]:
df

Unnamed: 0,zipcode,borough,neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing Centre
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


The shape of the final table is shown below.

In [7]:
df.shape

(103, 3)