Importing libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

Get the content of the wiki page by using BeautifulSoup

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_html = requests.get(url).text
soup = BeautifulSoup(wiki_html, 'html.parser')

Convert content of HTML table as list of data and creating dataframe

In [3]:
data = []
for tr in soup.tbody.find_all('tr'):
    data.append([ td.get_text().strip() for td in tr.find_all('td')])
df = pd.DataFrame(data, columns=['PostalCode','Borough','Neighborhood'])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


Drop rows with values "None" and those that contain "Not assigned"

In [4]:
df = df.dropna()
na = 'Not assigned'
df = df[(df.PostalCode != na ) & (df.Borough != na) & (df.Neighborhood != na)]
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


Combine Neighborhoods in the dataframe by grouping 'PostalCode' and 'Borough'. Convert the groupby value as string separated by commas and convert back to a new dataframe

In [5]:
def neighborhood_list(grouped):    
    return ', '.join(sorted(grouped['Neighborhood'].tolist()))
                    
df_group = df.groupby(['PostalCode', 'Borough'])
df_grouped = df_group.apply(neighborhood_list).reset_index(name='Neighborhood')
df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Save the dataframe to a csv file for coming assignments

In [8]:
df_grouped.to_csv('Ass1.csv', index=False)

Print the shape of new dataframe

In [10]:
df_grouped.shape

(102, 3)