# Segmenting and Clustering in Neighborhoods of Toronto - 1
## Web Scraping using BeautifulSoup Package

In this assignment, the list of postal codes of canada is extracted from a wikipedia page. Web scraping is done using the package Beautiful Soup.

Importing the required packages

In [1]:
import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup

Creating an object for Beautiful Soup with lxml parser. With this object, all the tags in the html code of the webpage can be accessed. 

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)

bs4.BeautifulSoup

In [3]:
#Title of the webpage

title = soup.title
print(title)

<title>List of postal codes of Canada: M - Wikipedia</title>


In [4]:
#accessing the table tag of the html code using soup object. 
#This has all the rows and columns of the postal codes data in html format

table=soup.table
print(table)

<table class="wikitable">
<tbody><tr>
<th>Postal code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park / Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor / Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park / Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern / Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M3B
</td>
<td>North York
</td>
<td>Don Mills
</td></tr>
<tr>
<td>M4B
</td>
<td>East York
<

Defining a function to extract the data from the table coded in html

In [5]:
def tableDataText(table):       
    rows = []
    trs = table.find_all('tr')
    headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row
    return rows

In [6]:
data=tableDataText(table)

In [7]:
data[0:2]

[['Postal code', 'Borough', 'Neighborhood'], ['M1A', 'Not assigned', '']]

The data extracted from the html code is in the form of series. So this series is converted into a pandas dataframe.

In [8]:
df=pd.DataFrame(data)

df.head()

Unnamed: 0,0,1,2
0,Postal code,Borough,Neighborhood
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


Since the column headers are extracted as one of the rows in the dataframe, it is converted to column names separately.

In [9]:
headers=df.iloc[0]
df.columns=headers
df=pd.DataFrame(df[1:],columns=headers)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Regent Park / Harbourfront


In [10]:
#checking the size of the dataframe

df.shape

(180, 3)

Eliminating the rows which has "Borough" values ad "Not assigned"

In [11]:
#finding the index values of those rows

index_values=df[df['Borough']=='Not assigned'].index
index_values

Int64Index([  1,   2,   8,  11,  16,  17,  20,  25,  26,  29,  30,  34,  35,
             36,  38,  39,  43,  44,  45,  52,  53,  54,  61,  62,  63,  70,
             71,  72,  79,  80,  88,  89,  97,  98, 102, 106, 107, 111, 116,
            119, 120, 124, 125, 126, 128, 129, 132, 133, 134, 135, 137, 138,
            141, 142, 146, 147, 150, 151, 155, 156, 159, 160, 162, 163, 164,
            165, 167, 168, 171, 172, 173, 174, 175, 176, 177, 178, 180],
           dtype='int64')

In [12]:
#dropping those rows

df.drop(index_values, inplace=True)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Regent Park / Harbourfront
6,M6A,North York,Lawrence Manor / Lawrence Heights
7,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [14]:
df.shape

(103, 3)

In [15]:
#restting the index values after dropping the "Not assigned"

df.reset_index(drop=True,inplace=True)
df.head(15)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,Malvern / Rouge
7,M3B,North York,Don Mills
8,M4B,East York,Parkview Hill / Woodbine Gardens
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Validating if any of the postal codes are repeated in the table

In [16]:
codes=df['Postal code'].duplicated()
for i in codes:
    if i==True:
        print(i) 
else:
    print("No duplicate postal codes")

No duplicate postal codes


Checking if any of the rows in neighborhood column is empty - ' '

In [17]:
df[df['Neighborhood']=='']


Unnamed: 0,Postal code,Borough,Neighborhood


In [18]:
df.head(10)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,Malvern / Rouge
7,M3B,North York,Don Mills
8,M4B,East York,Parkview Hill / Woodbine Gardens
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [19]:
#Size of the dataframe after dropping the not assigned values.

df.shape

(103, 3)

In [20]:
#converting the dataframes to csv file for further assignments.
df.to_csv(r'......\Postal_codes_Canada.csv',index=False)