# Segmenting and Clustering Neighborhoods in Toronto

Install respective libraries

In [1]:
!pip install BeautifulSoup4
!pip install html5lib



Importing data from the provided Wikipedia link and storing into an object, in this case "table".

In [3]:
import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').read()
soup = bs.BeautifulSoup(source,'lxml')

table = soup.find('table', attrs={'class': 'wikitable sortable'})

Storing the table sourced in the form of html into a dataframe.

In [4]:
import pandas as pd

data = []
column_names = ['Postal Code', 'Borough', 'Neighborhood'] 
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values
    
df = pd.DataFrame(data, columns = column_names)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


Drop first row as it has no values. Drop all rows where Borough = 'Not assigned'

In [5]:
df.drop(df[df.Borough == 'Not assigned'].index, inplace=True)
df = df.drop(df.index[0])
df = df.reset_index(drop=True)
df.head(15)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


A cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough.

In [6]:
for i in df['Neighborhood']:
    if i=="Not assigned":
        idx = df[df['Neighborhood']=="Not assigned"].index.item()
        value = df['Borough'].values[idx]
        df['Neighborhood'] = df['Neighborhood'].replace(['Not assigned'], value)
df.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Grouping all postal codes into single rows and displaying their respective Neighborhoods

In [7]:
grouped_df = df.groupby(['Postal Code', 'Borough'], sort=False).agg( ','.join).reset_index()
grouped_df.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


Sort the dataframe as per postal code which makes it easier to add columns further.

In [13]:
sorted_df = grouped_df.sort_values(by='Postal Code')
sorted_df = sorted_df.reset_index()
sorted_df = sorted_df.drop('index', axis=1)
sorted_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [14]:
sorted_df.shape

(103, 3)