# Segmenting and Clustering Neighborhoods in Toronto

First, we'll import URLLib and BeautifulSoup for our webscraping. We will also set our URL and table / rows.

In [1]:
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")
table = soup.find('table',{'class':''})
table_rows = table.find_all('td')

Now we will write a loop to find all of our FSAs. The way we find it will group them all together in one column, so the rest of the lines will be to break them apart into separate columns and format to the specifications in the assignment.

In [2]:
data = []
for row in table_rows:
    data.append([t.text.strip() for t in row.find_all('p')])
    
df = pd.DataFrame(data, columns=['Area'])
df['PostalCode'] = df['Area'].str[:3]
df = df[df.Area.str.contains("Not assigned") == False]
df['Area'] = df['Area'].str[3:]
df[['Borough','Neighborhood']] = df.Area.str.split('\(|\)', expand=True).iloc[:,[0,1]]
df = df.drop(['Area'], axis=1)
df = df.stack().str.replace(' /',',').unstack()
df.Neighborhood.fillna(df.Borough, inplace=True)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,"Queen's Park, Ontario Provincial Government","Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Finally, the shape of our dataframe.

In [3]:
df.shape

(103, 3)

# Joining Lat & Lon Values

First, we'll read the geospatial CSV.

In [4]:
gsp = pd.read_csv('../input/Geospatial_Coordinates.csv')
gsp

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Now we can simply merge on our Postal Code values and that will leave us with the previous dataframe with the addition of lat & lon values. We will drop the redundant Postal Code column.

In [5]:
df2 = df.merge(gsp, how='inner', left_on='PostalCode', right_on='Postal Code')
df2 = df2.drop(['Postal Code'], axis=1)
df2

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,"Queen's Park, Ontario Provincial Government","Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509
