# Scrap Wikipedia Data

Download wikipedia data

In [83]:
!wget -q -O 'postal.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Parse the data using BeautifulSoup

In [84]:
from bs4 import BeautifulSoup

with open('postal.html', encoding='utf-8') as html_doc:   # utf-8
    soup = BeautifulSoup(html_doc, 'html.parser')

The required data is in <td> tag

In [85]:
soup('td')

[<td>M1A</td>, <td>Not assigned</td>, <td>Not assigned
 </td>, <td>M2A</td>, <td>Not assigned</td>, <td>Not assigned
 </td>, <td>M3A</td>, <td><a href="/wiki/North_York" title="North York">North York</a></td>, <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
 </td>, <td>M4A</td>, <td><a href="/wiki/North_York" title="North York">North York</a></td>, <td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
 </td>, <td>M5A</td>, <td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>, <td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
 </td>, <td>M5A</td>, <td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>, <td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
 </td>, <td>M6A</td>, <td><a href="/wiki/North_York" title="North York">North York</a></td>, <td><a href="/wiki/Lawrence_Heights" title="Lawrence Heights">Lawrence Heights

Add all <td> in array.  Stop when found empty string. Resize the array to create a data frame

In [86]:
import numpy as np

a = []
for tag in soup('td'):
    s = tag.text.strip()
    if s == '': break
    a.append(s)
    
a = np.array(a)
print(a.shape)
a = a.reshape((-1,3))
print(a.shape)

(867,)
(289, 3)


Create a DataFrame

In [87]:
import pandas as pd

df = pd.DataFrame(a)   # if dict is used, the column name would not be in this order
df.columns = ['PostalCode', 'Borough', 'Neighborhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Check missing data

In [88]:
print(((df.Borough == 'Not assigned') & (df.Neighborhood == 'Not assigned')).sum())
print(((df.Borough != 'Not assigned') & (df.Neighborhood == 'Not assigned')).sum())
print(((df.Borough == 'Not assigned') & (df.Neighborhood != 'Not assigned')).sum())

77
1
0


Remove missing data

In [89]:
df = df[(df.Borough != 'Not assigned') | (df.Neighborhood != 'Not assigned')]

s = df[df.Neighborhood == 'Not assigned'].Borough
df.loc[df.Neighborhood == 'Not assigned', 'Neighborhood'] = s

df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [90]:
df.shape

(212, 3)