# SEGMENTING AND CLUSTERING NEIGHBORHOODS IN CANADA
-----------------------------------
## PART 1 : Scraping dataframe from wikipedia

### The given code in this notebook is used to extract data of postal codes from wikipedia page.

In [7]:
#importing required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd


In the following code, using wikipedia library, table from the wikipedia page is scraped easily and converted to a dataframe.

In [8]:
#wikipedia library is installed before being imported (using the following commented line of code)
!pip install wikipedia

import wikipedia as wp
 
#Get the html source
html = wp.page("List of postal codes of Canada: M").html().encode("UTF-8")
df = pd.read_html(html)[0]
df.to_csv('postal_codes_of_Canada.csv',header=0,index=False)
print(df.shape)
df.head()

Collecting wikipedia
  Downloading https://files.pythonhosted.org/packages/67/35/25e68fbc99e672127cc6fbb14b8ec1ba3dfef035bf1e4c90f78f24a80b7d/wikipedia-1.4.0.tar.gz
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheels/87/2a/18/4e471fd96d12114d16fe4a446d00c3b38fb9efcb744bd31f4a
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
(288, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


The dataframe obtained is now processed and cleaned into the desired dataframe.

Following three steps are performed in the following code
1. rows with a not assigned Borough are removed.
2. rows with a not assigned neighbourhood are replaced by their corresponsing borough values.
3. data is grouped according to the postcode and borough.

In [9]:
# remove rows with Borough = Not Assigned
df = df[df.Borough != 'Not assigned']
print(df.shape)

#remove 'Not assigned' neighbourhoods with their respective Borough values.
df['Neighbourhood'] = df['Neighbourhood'].replace('Not assigned',df['Borough'])
df.head()

(211, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [10]:
#group dataframe by PostCode and Borough
df_gpd = df.groupby(['Postcode','Borough'],as_index=False).agg(','.join)
df_gpd.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


After the data is cleaned, the shape of the dataframe is printed as below.

In [11]:
print(df_gpd.shape)

(103, 3)


---------------------------------

## PART 2 : Getting latitude and longitude coordinates of each neighbourhood


In [23]:
cord = pd.read_csv('https://cocl.us/Geospatial_data')
cord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [37]:
pc_df = pd.merge(left=df_gpd,right=cord, left_on='Postcode', right_on='Postal Code')
pc_df = pc_df.drop(columns=['Postal Code'])
print(pc_df.shape)
pc_df.head()

(103, 5)


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
