# Segmenting and Clustering Neighborhoods in Toronto

This Notebook is used to build the code to scrape the following Wikipedia page,

> https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

#### Let's get started!

First, we have to import the libraries:

In [1]:
#import libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

Now, we scrap the wikipedia table by using BeautifulSoup and pandas libraries.

In [2]:
wiki = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(wiki,'lxml')
table = soup.find("table",{"class":"wikitable sortable"})

In [3]:
#table #to see its content

#### We can now create a data frame by looping through BeautifulSoup Table.

I recommend that you see the contents of the table to better understand this process.

In [4]:
#Create data frame

columns=['Postal Code','Borough','Neighborhood']
df=pd.DataFrame(columns=columns)
p=[]
b=[]
n=[]

for row in table.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 3:
        p.append(cells[0].find(text=True).lstrip('\n').strip())
        b.append(cells[1].find(text=True).lstrip('\n').strip())
        n.append(cells[2].find(text=True).lstrip('\n').strip())

df['Postal Code']=p
df['Borough']=b
df['Neighborhood']=n
df.head(11)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


#### Let's clean our data!
We only need the cells that have an assigned borough. Therefore, we can ignore the cells with a borough that is 'Not assigned'.

In order to do that let's filter our data where the Borough is not equal to 'Not assigned':

In [6]:
ng_tnt=df[df['Borough'] != 'Not assigned'].reset_index(drop=True)

More than one neighborhood can exist in one postal code area. Luckly, the data frame that was extracted from Wikipedia came in that sort of way. So it is only necessary to check if the Postal Code are grouped correctly. Therefore, if the shape of the unique values of the Borough column is the same as the original, we can assume it is correct.

In [7]:
ng_tnt['Postal Code'].unique().shape[0]==ng_tnt['Postal Code'].shape[0]

True

#### Great! Now let's confirme that any neighborhood has 'Not assigned' values:

In [8]:
ng_tnt[ng_tnt['Neighborhood'] == 'Not assigned'].shape[0]

0

And get the shape of the grouped data frame:

In [9]:
ng_tnt.shape

(103, 3)