# SEGMENTING AND CLUSTERING NEIGHBORHOODS IN CANADA
-----------------------------------
## Created by Zohair Hashmi

### The given code in this notebook is used to extract data of postal codes from wikipedia page.

In [34]:
#importing required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd


In the following code, using wikipedia library, table from the wikipedia page is scraped easily and converted to a dataframe.

In [35]:
#wikipedia library is installed before being imported (using the following commented line of code)
#!pip install wikipedia

import wikipedia as wp
 
#Get the html source
html = wp.page("List of postal codes of Canada: M").html().encode("UTF-8")
df = pd.read_html(html)[0]
df.to_csv('postal_codes_of_Canada.csv',header=0,index=False)
print(df.shape)
df.head()

(288, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


The dataframe obtained is now processed and cleaned into the desired dataframe.

Following three steps are performed in the following code
1. rows with a not assigned Borough are removed.
2. rows with a not assigned neighbourhood are replaced by their corresponsing borough values.
3. data is grouped according to the postcode and borough.

In [36]:
# remove rows with Borough = Not Assigned
df = df[df.Borough != 'Not assigned']
print(df.shape)

#remove 'Not assigned' neighbourhoods with their respective Borough values.
df['Neighbourhood'] = df['Neighbourhood'].replace('Not assigned',df['Borough'])
df.head()

(211, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [37]:
#group dataframe by PostCode and Borough
df_gpd = df.groupby(['Postcode','Borough'],as_index=False).agg(','.join)
df_gpd.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


After the data is cleaned, the shape of the dataframe is printed as below.

In [38]:
print(df_gpd.shape)

(103, 3)
