# Segmenting and Clustering Neighborhoods in Toronto

## 1. Webscraping

In this section we will scrape the following [link](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) to obtain the data that is in a table of postal codes and transform it into a dataframe.

In [1]:
!pip install beautifulsoup4



In [2]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import pandas as pd
import re
import numpy as np

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# Ask for url, open it and parse html
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

In [3]:
# find table row tags
table = soup.find('table')
table_rows = table.find_all('tr')

In [4]:
# extract rows to dataframe
data = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        data.append(row)

df = pd.DataFrame(data, columns=["PostalCode", "Borough", "Neighborhood"])
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Queen's Park,Not assigned


## 2. Clean the data

In [5]:
# filter out datapoints that have no assigned borough
df = df[df["Borough"] != "Not assigned"]

In [6]:
# copy over borough value where there is no assigned neighborhood
df["Neighborhood"] = df["Borough"].where((df["Neighborhood"] == "Not assigned"), df["Neighborhood"])

In [7]:
# combine the negihborhoods from datapoints with the same postcodes
df = df.groupby("PostalCode").agg({"Borough": "first", "Neighborhood": ", ".join}).reset_index()

In [8]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [9]:
df.shape

(103, 3)