## Segmenting and Clustering Neighborhoods in Toronto

This notebook scrape a Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,   and obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

In [1]:
# import libraries
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

### 1. Scrape page

In [2]:
# connect to page and get html content
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
doc=requests.get(url)

### 2. Convert to BeautifulSoup Object and get table
Use the BeautifulSoup package to transform the data in the table on the Wikipedia page into the pandas dataframe

In [3]:
# convert to BeautifulSoup Object
html_content=BeautifulSoup(doc.content,'lxml')
print(html_content.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":862527922,"wgRevisionId":862527922,"wgArticleId":539066,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wg

In [4]:
# get table and transform it into pandas dataframe
table=html_content.find_all('table')[0]
df=pd.read_html(str(table))[0]
df

Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned


### 3. Data wrangling

In [5]:
# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
df.columns=df.loc[0]
df.drop(0,inplace=True)
df

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned
10,M8A,Not assigned,Not assigned


In [6]:
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned

# Replace "Not assigned" with "NaN" in column "Borough"
df_clean=df.copy()
df_clean['Borough'].replace("Not assigned", np.nan, inplace = True)
df_clean.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,,Not assigned
2,M2A,,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


In [7]:
# Drop rows with value "NaN"
df_clean.dropna(inplace=True)
df_clean.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


In [8]:
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
df_clean['Neighbourhood'].replace("Not assigned",df_clean['Borough'], inplace = True)
df_clean

Unnamed: 0,Postcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Queen's Park
11,M9A,Etobicoke,Islington Avenue
12,M1B,Scarborough,Rouge
13,M1B,Scarborough,Malvern


In [9]:
# merge the Neighbourhood with the same Postcode
df_group=df_clean.groupby(['Postcode','Borough']).aggregate(lambda x:', '.join(x))
df_group.reset_index()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [10]:
#  the number of rows of df_group
df_group.shape[0]

103