# Segmenting and Clustering Neighborhoods in Toronto - Part 1

For this assignment, I will use the Beautiful Soup package to web scrape the given wikipedia page.

In [1]:
from bs4 import BeautifulSoup
print('Beautiful Soup Package Imported!')

Beautiful Soup Package Imported!


In [2]:
import requests
import pandas as pd

The first step is to scrape the relevant data from the wikipedia page

In [3]:
# ping the wikipedia page and scrape the data/return html
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
website_url = requests.get(url).text 

In [4]:
# parse the doc using lxml's HTML parser
soup = BeautifulSoup(website_url, 'lxml')
# print(soup.prettify()) #unedit to view full html

In [5]:
# locate the table of interest
c_table = soup.find('table', {'class':'wikitable sortable'})
# c_table  # unedit to view the contents

In [6]:
# extract the data within <tr>
data = c_table.findAll('tr')
# data  # unedit to view the contents

Now that we have scraped the relevant data, we must store it in a cleaner fashion. That is, row by row in a new list. Then we can create the new dataframe with the required features.

In [11]:
# store the data row by row in a new list
table = []

# looping through each row in the data and appending it to the empty 'table' list
for row in data:
    table.append([t.text.strip() for t in row.findAll('td')])

# create the new dataframe where Borough is assigned
df = pd.DataFrame(table, columns = ['PostalCode', 'Borough', 'Neighborhood'])
df = df[df['Borough'] != 'Not assigned']  

# Neighborhood will be same as Borough if not assigned
df[df['Neighborhood']=='Not assigned']=df['Borough']

# clean up the layout of the dataframe
df.drop(df.head(1).index, inplace=True)
df.reset_index(drop=True, inplace=True)
df

ValueError: cannot set using a list-like indexer with a different length than the value

Finally, we will use the .shape method to identify the number of rows in the dataframe

In [8]:
df.shape

(103, 3)