# Segmenting and Clustering Neighborhoods in Toronto

This Notebook is used to build the code to scrape the following Wikipedia page,

> https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

#### Let's get started!

First, we have to import the libraries:

In [21]:
#import libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

Now, we scrap the wikipedia table by using BeautifulSoup and pandas libraries.

In [82]:
wiki = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(wiki,'lxml')
table = soup.find("table",{"class":"wikitable sortable"})

In [23]:
#table #to see its content

#### We can now create a data frame by looping through BeautifulSoup Table.

I recommend that you see the contents of the table to better understand this process.

In [84]:
#Create data frame

columns=['Postal Code','Borough','Neighborhood']
df=pd.DataFrame(columns=columns)
p=[]
b=[]
n=[]

for row in table.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 3:
        p.append(cells[0].find(text=True).lstrip('\n').strip())
        b.append(cells[1].find(text=True).lstrip('\n').strip())
        n.append(cells[2].find(text=True).lstrip('\n').strip())

df['Postal Code']=p
df['Borough']=b
df['Neighborhood']=n
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### Let's clean our data!
We only need the cells that have an assigned borough. Therefore, we can ignore the cells with a borough that is 'Not assigned'.

In order to do that let's filter our data where the Borough is not equal to 'Not assigned':

In [85]:
ng_tnt=df[df['Borough'] != 'Not assigned'].reset_index(drop=True)

More than one neighborhood can exist in one postal code area. So we need to get a list of the neighborhoods for each postal code area:

In [86]:
ng_group=ng_tnt.groupby('Postal Code').count().reset_index()
nn=ng_tnt.groupby('Postal Code')['Neighborhood'].apply(lambda group_series: group_series).reset_index()
bb=ng_tnt.groupby('Postal Code')['Borough'].apply(lambda group_series: group_series).reset_index()
ng_group['Neighborhood']=nn['Neighborhood']
ng_group['Borough']=bb['Borough']
ng_group.head(11)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,North York,Parkwoods
1,M1C,North York,Victoria Village
2,M1E,Downtown Toronto,"Regent Park, Harbourfront"
3,M1G,North York,"Lawrence Manor, Lawrence Heights"
4,M1H,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M1J,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1K,Scarborough,"Malvern, Rouge"
7,M1L,North York,Don Mills
8,M1M,East York,"Parkview Hill, Woodbine Gardens"
9,M1N,Downtown Toronto,"Garden District, Ryerson"


And finally, let's get the shape of the grouped data frame

In [87]:
ng_group.shape

(103, 3)