# Segmenting and Clustering Toronto Neighborhoods

## Scrape Wikipedia

In [16]:
import pandas as pd

wiki_link = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

Pandas provides a method for reading html directly into a DataFrame.

In [17]:
# Read the wikipedia tables into dataframes
dfs = pd.read_html(wiki_link)
# the first DF contains borough data
df = dfs[0]

## Scrub the Data

Now that the data is obtained, we must clean it up a bit. 

In [45]:
# The DF will consist of three columns: PostalCode, Borough and Neighborhood
df = df.rename(columns={'Postal code':'PostalCode'})
df.columns

Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')

We are tasked with removing any rows that have a borough of 'Not assigned'. This will also remove any rows that may have had a value of 'Not assigned' in neighborhood

In [49]:
# Ignore cells that have a borough 'Not assigned'
# This also captures empty Neighborhood fields
df = df.drop(labels=df.loc[df.Borough == 'Not assigned'].index)
# Reset the index
df.reset_index(drop=True, inplace=True)

Replace the characters ' / ' with ', ' to match the formatting of the provided example

In [50]:
# Use commas instead of slashes for boroughs made up of multiple
# Neighborhoods
df.Neighborhood = df.Neighborhood.apply(lambda x: x.replace(' / ', ', '))

# Use example from prompt to show it is completed
df.loc[df.PostalCode == 'M5A']

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [51]:
# Print the number of rows in our DataFrame
df.shape

(103, 3)