# Segmenting and Clustering Neighborhoods in Toronto

This is part of IBM Data Science Certification - Capstone project week 3 to segment and cluster neighbouthoods in Toronto. 

To explore and cluster the neighborhoods in Toronto, there is a Wikipedia page that has all the information required. 
We will perform 3 steps to complete the project - 

Step 1 - we will scrape the Wikipedia page.

Step 2 - we will perform data wrangling and data cleaning. Then, we will create a pandas dataframe from this data to make a structured format.

Step 3 - we will use K-means to cluster the data.

## import required libraries

In [22]:
#get required libraries

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from IPython.display import display_html

## Step 1 - Scrap wikipedia page to get postal codes of Canada and create pandas dataframe

In [23]:
#scrap wikipedia page
page_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(page_url).text

soup=BeautifulSoup(source,'xml')

tab = str(soup.table)
#display_html(tab,raw=True)

In [24]:
#Convert html to Pandas dataframe to enable cleaning and processing of data

dfs = pd.read_html(tab)
df=dfs[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Step 2 - Data Preprocessing and cleaning

In [25]:
#Ignore cells with borough that are 'Not assigned' so that we only process the rows with an assigned borough
df1 = df[df.Borough != 'Not assigned']

#Combine neighbourhoods with same Postal Code
df2 = df1.groupby(['Postal Code','Borough'], sort=False).agg(', '.join)
df2.reset_index(inplace=True)

#Replace data where neighbourhood that are 'Not assigned' with the name of Borough
df2['Neighbourhood'] = np.where(df2['Neighbourhood'] == 'Not assigned',df2['Borough'], df2['Neighbourhood'])

df2.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [26]:
#display shape of the dataframe
df2.shape

(103, 3)