# Segmenting and Clustering Neighborhoods in Toronto

## Question 1
### Scrapting table of postal code of Toronto from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M using package BeautifulSoup and turn into pandas DataFrame

In [1]:
import pandas as pd 
import requests 
from bs4 import BeautifulSoup 

req = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M") 
soup = BeautifulSoup(req.content,'lxml') 
table = soup.find_all('table')[0]  
df = pd.read_html(str(table)) 

df1=pd.DataFrame(df[0])
df1

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


### Rename column headers

In [2]:
df1 = df1.rename(index=str, columns={'Postal Code':'PostalCode','Neighbourhood':'Neighborhood'})
df1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Ignore the cells with a borough that is Not assigned

In [3]:
df1 = df1[df1['Borough']!='Not assigned']

### If multiple neighborhoods are having the same postal code then put it together with comma in Neighborhood column

In [4]:
#group multiple neighborhoods having same postal code
df1 = df1.groupby(['PostalCode', 'Borough'], as_index=False).agg(lambda x: ", ".join(x))
df1['Neighborhood'] = df1['Neighborhood'].str.replace('/', ',')
df1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### If a Neighborhood is not assigned then we assign their Borough as its neighborhood

In [5]:
df1[df1['Neighborhood'] == 'Not assigned']['Neighborhood'] = df1['Borough']
df1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### See the shape of dataframe

In [6]:
df1.shape

(103, 3)

## Question 2
### Adding Lattitude and Longitude to the dataframe

In [13]:
#using the coordinate from https://cocl.us/Geospatial_data
df2 = pd.read_csv('http://cocl.us/Geospatial_data')
df2.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [15]:
#rename header to be the same with previous dataframe
df2.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
df2.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [12]:
#check if number of row is the same as before or not
#if it is equare, we can merge these dataframes together
df2.shape

(103, 3)

### Merged dataframe with coordinate

In [18]:
df1 = df1.merge(df2, on="PostalCode", how="left")
df1.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude_x,Longitude_x,Latitude_y,Longitude_y,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,43.806686,-79.194353,43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,43.784535,-79.160497,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,43.763573,-79.188711,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917,43.770992,-79.216917,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,43.773136,-79.239476,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,43.744734,-79.239476,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029,43.727929,-79.262029,43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577,43.711112,-79.284577,43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476,43.716316,-79.239476,43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848,43.692657,-79.264848,43.692657,-79.264848
