# **Segmenting and Clustering Neighborhoods in Toronto**
 
  
### **Part I** 
by Yong Chang *(ychang2000@gmail.com)*

First, let's import the data from wiki webpage and do a quick check to verify the data acquisition.

In [4]:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
tables = pd.read_html(url) 
table = tables[0]

In [5]:
table.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [6]:
table.tail()

Unnamed: 0,Postcode,Borough,Neighbourhood
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor
287,M9Z,Not assigned,Not assigned


In [31]:
table.shape

(288, 3)

In [35]:
table.describe(include='all')

Unnamed: 0,Postcode,Borough,Neighbourhood
count,288,288,288
unique,180,12,209
top,M9V,Not assigned,Not assigned
freq,8,77,78




Next, let's work on the data wrangling. As being asked, we will ***remove*** the cells with a ***borough*** that is ***'Not assigned'***.

In [41]:
table_temp1=table.loc[table['Borough']!='Not assigned'].reset_index(drop=True)
table_temp1

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
...,...,...,...
206,M8Z,Etobicoke,Kingsway Park South West
207,M8Z,Etobicoke,Mimico NW
208,M8Z,Etobicoke,The Queensway West
209,M8Z,Etobicoke,Royal York South West


Then, for neighborhoods having the same postal code, combine them separated with a comma.

In [71]:
table_temp2=table_temp1.groupby('Postcode',as_index=False).agg(lambda x:','.join(set(x)))
table_temp2.tail(20)

Unnamed: 0,Postcode,Borough,Neighbourhood
83,M6R,West Toronto,"Parkdale,Roncesvalles"
84,M6S,West Toronto,"Swansea,Runnymede"
85,M7A,Queen's Park,Not assigned
86,M7R,Mississauga,Canada Post Gateway Processing Centre
87,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
88,M8V,Etobicoke,"Mimico South,New Toronto,Humber Bay Shores"
89,M8W,Etobicoke,"Alderwood,Long Branch"
90,M8X,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North"
91,M8Y,Etobicoke,"Sunnylea,Kingsway Park South East,Mimico NE,Th..."
92,M8Z,Etobicoke,"Kingsway Park South West,The Queensway West,Mi..."


Now, as the assignment asked, if a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [72]:
table_temp2.loc[table_temp2['Neighbourhood']=='Not assigned','Neighbourhood']=table_temp2['Borough']

In [73]:
table_temp2.tail(20)

Unnamed: 0,Postcode,Borough,Neighbourhood
83,M6R,West Toronto,"Parkdale,Roncesvalles"
84,M6S,West Toronto,"Swansea,Runnymede"
85,M7A,Queen's Park,Queen's Park
86,M7R,Mississauga,Canada Post Gateway Processing Centre
87,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
88,M8V,Etobicoke,"Mimico South,New Toronto,Humber Bay Shores"
89,M8W,Etobicoke,"Alderwood,Long Branch"
90,M8X,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North"
91,M8Y,Etobicoke,"Sunnylea,Kingsway Park South East,Mimico NE,Th..."
92,M8Z,Etobicoke,"Kingsway Park South West,The Queensway West,Mi..."


Finally, let's check the shape of the dataframe as required. Also, rename the columns as the assignment used for consistency.

In [76]:
table_cy=table_temp2
table_cy.rename(columns={'Postcode':'PostalCode','Neighbourhood':'Neighborhood'},inplace=True)
table_cy.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Morningside,Guildwood,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [77]:
table_cy.shape

(103, 3)

So now it has 103 rows.

### ***Thank You!***