Install website scraping package

In [17]:
!pip install requests

!pip install beautifulsoup4 

Requirement not upgraded as not directly required: requests in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from requests)
Requirement not upgraded as not directly required: idna<2.7,>=2.5 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from requests)
Requirement not upgraded as not directly required: urllib3<1.23,>=1.21.1 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from requests)
Requirement not upgraded as not directly required: certifi>=2017.4.17 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from requests)
Requirement not upgraded as not directly required: beautifulsoup4 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages


Import required libraries

In [2]:
from bs4 import BeautifulSoup
import requests
import urllib.request
import pandas as pd
import numpy as np

Get a local copy of the Wikipedea article

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
req = urllib.request.urlopen(url)
article = req.read().decode()
with open('List_postal_codes_Canada.html', 'w') as fo:
    fo.write(article)

Load the HTML file and parse using Beautiful Soup

In [4]:
# Load article, turn into soup and get the <table>s.
article = open('List_postal_codes_Canada.html').read()
soup = BeautifulSoup(article, 'html.parser')
tables = soup.find_all('table', class_='sortable')

Validate the headings

In [7]:
df=pd.DataFrame()
for table in tables:
    ths = table.find_all('th')
    headings = [th.text.strip() for th in ths]
df=headings
df

['Postcode', 'Borough', 'Neighbourhood']

Scan through the HTML tags and extract the data from the table to build the pandas Dataframe

In [8]:
df_table=[]
for tr in table.find_all('tr'):
    tds = tr.find_all('td')
    if not tds:
        continue
    row = [td.text.strip() for td in tds[:3]]
    df_table.append(row)
df_table1=pd.DataFrame(df_table,columns=['Postcode', 'Borough', 'Neighbourhood'])
df_table1

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned. Drop all the rows which has the value for Borough as 'Not assigned'. 

In [9]:
df_table1=df_table1[df_table1.Borough != 'Not assigned']
df_table1

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

Search for 'Not assigned' in Neighbourhood and replace it with value from Borough.


In [16]:
cond = df_table1['Neighbourhood'] == 'Not assigned'
df_table1['Neighbourhood'] = np.where(cond, df_table1['Borough'],df_table1['Neighbourhood'])
df_table1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

Groupby using Postcode and aggregate Neighbourhood values separated with commas.


In [11]:
df_table2=df_table1.groupby('Postcode', as_index=False).agg(lambda x: ', '.join(set(x.astype(str))))
df_table2

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"West Hill, Morningside, Guildwood"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Ionview, East Birchmount Park, Kennedy Park"
7,M1L,Scarborough,"Oakridge, Golden Mile, Clairlea"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Cliffside West, Birch Cliff"


'M5A' value is as expected!

In [13]:
df_table2.loc[df_table2['Postcode']=='M5A']

Unnamed: 0,Postcode,Borough,Neighbourhood
53,M5A,Downtown Toronto,"Harbourfront, Regent Park"


'M7A' values is as expected!

In [14]:
df_table2.loc[df_table2['Postcode']=='M7A']

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


The resultant table has 103 rows and 3 columns.

In [15]:
df_table2.shape

(103, 3)

Thank You! End of Part I of the Segmenting and Clustering Neighborhoods in Toronto Assignment...