# Segmenting and Clustering Neighborhoods in Toronto

### In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto.

#### Part 1: create a dataframe scraping the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [37]:
#import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [2]:
#webpage we want to scrape
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
#download the html content
results=requests.get(url)

In [4]:
#check if the html was downloaded properly, if it was you should get the code 200
results.status_code

200

In [5]:
#parsing
soup = BeautifulSoup(results.content,'html.parser')

In [6]:
#extracting the table, for this you'll need to look for the tag 'table' either on the website (right-clicking inspect) or in the variable soup we just created
#in my case I got: <table class="wikitable sortable">

#the tag is 'table' and atribute is the class_
stat_table = soup.find_all('table',class_='wikitable sortable')

In [7]:
#check how many tables, as it is just one, we found the one we are looking for
len(stat_table)

1

In [8]:
#checking the type of stat_table
type(stat_table)

bs4.element.ResultSet

In [9]:
#this is because find_all returns a resultset, a list. But we need a tag. 
stat_table = stat_table[0]
type(stat_table)

bs4.element.Tag

In [10]:
#creating dataframe with the stat_table 
#in the html code 'tr' represents the rows, 'th' the headers and 'td' the cells of the table

#create lists to store the table
data = []
columns = []

#filling every row
for i,row in enumerate(stat_table.find_all('tr')):
    section = []
    #filling every cell
    for cell in row.find_all(['th','td']):
        section.append(cell.text.rstrip())
    
    #make first row of data (index=0) the header
    if (i == 0):
        columns = section
    else:
        data.append(section)

#convert list into pandas DataFrame
df_can = pd.DataFrame(data = data,columns = columns)
df_can.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [11]:
#dropping any Borough that is not assigned
df_can = df_can[df_can['Borough'] != 'Not assigned']
df_can.reset_index(drop=True,inplace=True)
df_can.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [29]:
#For the rows with the same postcode, they will be combined into one row with the neighborhoods separated with a comma, using .goupby():
df_grouped=df_can.groupby(['Postcode','Borough'],as_index=False, sort=False).agg( ','.join)
df_grouped.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Not assigned


In [39]:
#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough:
df_grouped['Neighbourhood'] = np.where(df_grouped['Neighbourhood'] == 'Not assigned', df_grouped['Borough'], df_grouped['Neighbourhood'])
df_grouped.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


In [40]:
#shape of the dataframe
df_grouped.shape

(103, 3)