# Segmenting and Clustering Neighborhoods in Toronto

## Part 1:

**Objective:** to build a code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe:

In [1]:
# !pip3 install beautifulsoup4
# !pip3 install lxml
# !pip3 install requests

Importing the needed libraries:

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

Assigning the target web page to a variable:

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

Using <code>requests</code> library to get the sourcecode of the target web page:

In [4]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
src= url.text

Creating a <code> BeautifulSoup </code> object:

In [5]:
soup = BeautifulSoup(src, 'lxml')

Extracting the table from the page:

In [6]:
table = soup.find_all('table')[0]

Reading the HTML table into a <code>list</code> of <code> DataFrame </code> objects:

In [7]:
df = pd.read_html(str(table))[0]

In [8]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Dropping the rows where the **Borough** is _Not assigned_ :

In [9]:
df = df[~df.Borough.str.startswith('Not')]

In [10]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Grouping the rows with similar **Poscode** and joining the **Neighbourhood**s:

In [11]:
df =  df.groupby(['Postcode','Borough'], sort = False).agg(lambda x: ', '.join(x))

In [12]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighbourhood
Postcode,Borough,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M6A,North York,"Lawrence Heights, Lawrence Manor"
M7A,Downtown Toronto,Queen's Park


Resetting the index:

In [13]:
df.reset_index(level=['Postcode','Borough'], inplace=True)

In [14]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


Checking the shape of the dataframe:

In [15]:
df.shape

(103, 3)