# Segmenting and Clustering Neighborhoods in Toronto

## Question 1

Scrapting table of postal code of Toronto from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M using package BeautifulSoup and turn into pandas DataFrame

In [8]:
import pandas as pd 
import requests 
from bs4 import BeautifulSoup 

req = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M") 
soup = BeautifulSoup(req.content,'lxml') 
table = soup.find_all('table')[0]  
df = pd.read_html(str(table)) 

df1=pd.DataFrame(df[0])
df1

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Rename column headers

In [23]:
df1 = df1.rename(index=str, columns={'Postal Code':'PostalCode','Neighbourhood':'Neighborhood'})
df1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Ignore the cells with a borough that is Not assigned

In [24]:
df1 = df1[df1['Borough']!='Not assigned']

If multiple neighborhoods are having the same postal code then put it together with comma in Neighborhood column

In [26]:
#group multiple neighborhoods having same postal code
df1 = df1.groupby(['PostalCode', 'Borough'], as_index=False).agg(lambda x: ", ".join(x))
df1['Neighborhood'] = df1['Neighborhood'].str.replace('/', ',')
df1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


If a Neighborhood is not assigned then we assign their Borough as its neighborhood

In [27]:
df1[df1['Neighborhood'] == 'Not assigned']['Neighborhood'] = df1['Borough']
df1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


See the shape of dataframe

In [28]:
df1.shape

(103, 3)