# Segmenting and Clustering Neighborhoods in Toronto

First, we'll import URLLib and BeautifulSoup for our webscraping. We will also set our URL and table / rows.

In [1]:
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")
table = soup.find('table',{'class':''})
table_rows = table.find_all('td')

Now we will write a loop to find all of our FSAs. The way we find it will group them all together in one column, so the rest of the lines will be to break them apart into separate columns and format to the specifications in the assignment.

In [2]:
data = []
for row in table_rows:
    data.append([t.text.strip() for t in row.find_all('p')])
    
df = pd.DataFrame(data, columns=['Area'])
df['PostalCode'] = df['Area'].str[:3]
df = df[df.Area.str.contains("Not assigned") == False]
df['Area'] = df['Area'].str[3:]
df[['Borough','Neighborhood']] = df.Area.str.split('\(|\)', expand=True).iloc[:,[0,1]]
df = df.drop(['Area'], axis=1)
df = df.stack().str.replace(' /',',').unstack()
df.Neighborhood.fillna(df.Borough, inplace=True)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,"Queen's Park, Ontario Provincial Government","Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Finally, the shape of our dataframe.

In [3]:
df.shape

(103, 3)