Data is available on the web.Acquiring web data and structuring them is known as web scraping.BeautifulSoup is a Python package which is used for web scraping.Here we will scrape tables from webpages.

In [1]:
import numpy as np
import pandas as pd
import re
from bs4 import BeautifulSoup
import requests
import matplotlib.pyplot as plt
import urllib.request as urllib2

We will scrape data from Wikipedia.The data consists of Postcode,Borough and Neighbourhood.In the following code we read data and convert it to bs4.BeautifulSoup data.

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r=requests.get(url)
H=BeautifulSoup(r.content)
type(H)

bs4.BeautifulSoup

We first need to select the table that we'd like to scrape.As webpages contain multiple tables,we should read the table names into a list.

In [3]:
htmlpage=urllib2.urlopen(url)
lst=[]
for line in htmlpage:
    line=line.rstrip()
    if re.search(b'table class',line) :
        lst.append(line)

In [4]:
len(lst)

3

In [5]:
table=H.find('table',{'class','wikitable sortable'})
type(table)

bs4.element.Tag

In [6]:
x=lst[0]
print(x) 
extr=re.findall(b'"([^"]*)"',x)
#table=H.find('table',{'class',str(extr).strip("'[]'")})

b'<table class="wikitable sortable">'


After stripping of the unnecessary characters we read the header and row names separately.

In [7]:
headers=[header.text for header in table.find_all('th')]
headers

['Postcode', 'Borough', 'Neighbourhood\n']

In [8]:
rows=[]
for row in table.find_all('tr'):
    rows.append([val.text.encode('utf8') for val in row.find_all('td')])

In [9]:
df1=pd.DataFrame(rows,columns=headers)
df1.dropna(axis=0,inplace=True)
df1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
1,b'M1A',b'Not assigned',b'Not assigned\n'
2,b'M2A',b'Not assigned',b'Not assigned\n'
3,b'M3A',b'North York',b'Parkwoods\n'
4,b'M4A',b'North York',b'Victoria Village\n'
5,b'M5A',b'Downtown Toronto',b'Harbourfront\n'


We get b before every value in the table.This is because it is byte encoded.Its not a string.

In [10]:
df1['Borough']=df1['Borough'].str.decode("utf-8")


In [11]:
#df1['Neighbourhood\n'].replace(r'\\n','',regex=True,inplace=True)
#df1["Neighbourhood\n"]=df1["Neighbourhood\n"].apply(lambda x:x.replace('\\n',""))
#'][x.strip('\n') for x in df1.Neighbourhood]
#df1['Neighbourhood'].str.decode("utf-8")


In [12]:
df1['Postcode']=df1['Postcode'].str.decode("utf-8")
df1['Neighbourhood\n']=df1['Neighbourhood\n'].str.decode("utf-8")


In [13]:
df1.columns

Index(['Postcode', 'Borough', 'Neighbourhood\n'], dtype='object')

In [14]:
df1.columns=[i.strip() for i in df1.columns]

In [15]:
df1.columns
df1['Neighbourhood']=[i.strip() for i in df1.Neighbourhood]

In [16]:
df1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


After decoding it we get a table in the required format.

We will remove all the rows in Borough column which have 'Not assigned' written in them.

In [17]:
df1.head()
df2=df1[df1['Borough'] !='Not assigned']
df2.head()
#df2.index

Unnamed: 0,Postcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


We then combine those rows which have same postcodes so that the Neighbourhoods get concatenated like below.

In [18]:
df3=df2.Neighbourhood.groupby([df2.Postcode,df2.Borough]).apply(list).reset_index()
#df2['Neighbourhood'].groupby([df2.Postcode,df2.Borough]).apply
df3.head(100)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"[Rouge, Malvern]"
1,M1C,Scarborough,"[Highland Creek, Rouge Hill, Port Union]"
2,M1E,Scarborough,"[Guildwood, Morningside, West Hill]"
3,M1G,Scarborough,[Woburn]
4,M1H,Scarborough,[Cedarbrae]
5,M1J,Scarborough,[Scarborough Village]
6,M1K,Scarborough,"[East Birchmount Park, Ionview, Kennedy Park]"
7,M1L,Scarborough,"[Clairlea, Golden Mile, Oakridge]"
8,M1M,Scarborough,"[Cliffcrest, Cliffside, Scarborough Village West]"
9,M1N,Scarborough,"[Birch Cliff, Cliffside West]"


After this we search in the Neighbourhood column to see if any not assigned value is there.If it is there we replace it by the corresponding value in Borough column.

In [19]:
df3.loc[df3.Neighbourhood =='Not assigned','Neighbourhood']=df3.Borough
#df3.loc[df3['Neighbourhood'] == ['Not assigned']]
#df3_NA=df3[df3.Neighbourhood=='Not assigned']
#df3.Neighbourhood[df3.Neighbourhood==[Not assigned]]=df3.Neighbourhood.replace([Not assigned],df3.Borough)

In [20]:
def replace (df,col,key,val):
    m=[v==key for v in df[col]]
    df.loc[m,col]=val
    
replace(df3,'Neighbourhood',['Not assigned'],df3.Borough)


In [21]:
df3

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"[Rouge, Malvern]"
1,M1C,Scarborough,"[Highland Creek, Rouge Hill, Port Union]"
2,M1E,Scarborough,"[Guildwood, Morningside, West Hill]"
3,M1G,Scarborough,[Woburn]
4,M1H,Scarborough,[Cedarbrae]
5,M1J,Scarborough,[Scarborough Village]
6,M1K,Scarborough,"[East Birchmount Park, Ionview, Kennedy Park]"
7,M1L,Scarborough,"[Clairlea, Golden Mile, Oakridge]"
8,M1M,Scarborough,"[Cliffcrest, Cliffside, Scarborough Village West]"
9,M1N,Scarborough,"[Birch Cliff, Cliffside West]"


In [22]:
df3.shape

(103, 3)

In [23]:
type(df3)

pandas.core.frame.DataFrame

After running the above code,we get a dataframe with 103 rows and 3 columns.