# Toronto Neightbourhood
## Web scraping

First, we import modules (BeautifulSoup, requests, pandas).

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

Here we scrape the table data from the url with BeautifulSoup and obtain HTML tables containing zip code, borough and neighbourhood.

In [2]:
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M"
data  = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")
table = soup.find('table')
table_rows=table.find_all('tr')

Next, we extract data from each cell, where some data need to be cleaned by using regex.

In [3]:
zipcode = []
borough = []
neighbourhood = []
for row in table_rows:
    inside = row.find_all('span')
    for temp in inside:
        if temp.i == None:
            z = temp.parent.b.text
            zipcode.append(z)
            text = temp.get_text()
            pattern1 = '^[^\(]+'
            pattern2 = '\(([^)]+)\)'
            b = re.findall(pattern1, text)[0]
            n = re.findall(pattern2, text)[0]
            n = n.replace( ' /',',' )
            borough.append(b)
            neighbourhood.append(n)

In [7]:
data = {'PostalCode': zipcode, 'Borough': borough, 'Neighbourhood':neighbourhood}
df = pd.DataFrame(data)
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Some of the entries need to be fixed individually.

In [8]:
df.at[35,'Borough'] = 'East York'
df.at[76,'Borough'] = 'Mississauga'
df.at[76,'Neighbourhood'] = 'Canada Post Gateway Processing Centre'
df.at[92,'Borough'] = 'Downtown Toronto'
df.at[92,'Neighbourhood'] = 'The Esplanade'
df.at[94,'Borough'] = 'Etobicoke'
df.at[100,'Borough'] = 'East Toronto'
df.at[100,'Neighbourhood'] = 'Business reply mail Processing Centre'

In [9]:
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing Centre
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


The shape of the final table is shown below.

In [10]:
df.shape

(103, 3)

----
# Toronto Neightbourhood
## Geolocating

Turns out geocoder module is quite buggy and unreliable, so we have to rely on the provided geospatial dataset instead. <br>
First we import the csv file and rename column for consistency.

In [16]:
df2 = pd.read_csv("Geospatial_Coordinates.csv", float_precision='round_trip')
df2.rename(columns = {'Postal Code':'PostalCode'},inplace = True)
df2

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


And then we merge both table together.

In [17]:
result = pd.merge(df,df2)
result

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,Business reply mail Processing Centre,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509
