# Segmenting and Clustering Neighborhoods in Toronto
## Part I - Retrieve the Data
This part aims to retrieve Toronto neighborhoods from Wikipedia and format it into a dataframe
### Get the Data from Wikipedia
First, I use the urllib.request library to open the Wikipedia URL and put it into the variable "page"

In [184]:
import urllib.request
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = urllib.request.urlopen(url)

Then, with the help of the BeautifulSoup library, data are parsed and stored into a new variable "soup". The code is then printed in order to search for the part we are interested in.

In [185]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"9cb85652-fb29-4eee-9a72-8381449a074e","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":960187814,"wgRevisionId":960187814,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Communications in Ontario","Postal codes in Canada","Toron

Target table data is retrieved using BeautifulSoup find function

In [186]:
table=soup.find('table', class_='wikitable sortable')
table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td>

Each row in the table in separated by **_tr_** tags and inside each row data are separated by **_td_** tags. <br>BeautifulSoup findAll function is used together with a loop function to create three lists to store the data of the three columns of the table.

In [187]:
A=[]
B=[]
C=[]

for row in table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True).replace('\n',''))
        B.append(cells[1].find(text=True).replace('\n',''))
        C.append(cells[2].find(text=True).replace('\n',''))

Lists are then converted into the desired Dataframe:

In [188]:
import pandas as pd

df=pd.DataFrame(A,columns=['Postal Code'])
df['Borough']=B
df['Neighborhood']=C
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


### Format the Dataframe
Lines with no associated Borough are deleted

In [189]:
df.drop(df[df.Borough == 'Not assigned'].index, inplace = True)
'Not assigned' in df['Borough'].unique()

False

Not associated Neighborhood are then replaced by the Borough value. In our case, no "Not assigned" values were existing.

In [190]:
'Not assigned' in df['Neighborhood'].unique()

False

Print the final Dataframe

In [191]:
df.reset_index(drop = True)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [192]:
df.shape

(103, 3)

# Part II - Add Geographical Coordinates to the Locations
I was unable to use geocoder so I used the csv file instead

In [193]:
link = 'http://cocl.us/Geospatial_data'
df2 = pd.read_csv(link)
df2

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Now let's merge the two tables to create the final Dataframe

In [194]:
df_final = df
df_final = df.join(df2.set_index('Postal Code'), on = 'Postal Code')

In [195]:
df_final.reset_index(drop=True, inplace=True)

# Part III - Visualization of the Locations and Clustering
First, let's create a map of Toronto with markers of the neighborhoods

In [196]:
import folium

latitude = 43.71
longitude = -79.38

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, label in zip(df_final['Latitude'], df_final['Longitude'], df_final['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) 

map_toronto

My friend lives in Toronto and gave me his ranking of each locations based on multiple criteria. For the exercise purpose, this ranking is generated randomly, with a score between 0 and 20.

In [197]:
import numpy as np
df_final['Score'] = np.random.randint(0,20, size=len(df_final))
df_final

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Score
0,M3A,North York,Parkwoods,43.753259,-79.329656,11
1,M4A,North York,Victoria Village,43.725882,-79.315572,19
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,9
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,2
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,6
...,...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,8
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,11
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,4
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,3


Ranking is represented on the map by having the radius of each marker sized in function of its ranking.

In [198]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, label, radius in zip(df_final['Latitude'], df_final['Longitude'], df_final['Neighborhood'], df_final['Score']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) 

map_toronto

From this, 10 different clusters will be established based on scores and location (this analysis is very shallow, in real life mutliple other data could be added such as proximity to certain type of restaurants...). <br>
First thing is to establish the clusters with KMeans

In [199]:
from sklearn.cluster import KMeans

df_clustering = df_final.drop(['Postal Code', 'Neighborhood', 'Borough'], 1)

k = 10
kmeans = KMeans(n_clusters=k, random_state=0, max_iter = 5).fit(df_clustering)

df_final.insert(0, 'Cluster', kmeans.labels_)
df_final

Unnamed: 0,Cluster,Postal Code,Borough,Neighborhood,Latitude,Longitude,Score
0,2,M3A,North York,Parkwoods,43.753259,-79.329656,11
1,8,M4A,North York,Victoria Village,43.725882,-79.315572,19
2,6,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,9
3,3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,2
4,1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,6
...,...,...,...,...,...,...,...
98,6,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,8
99,2,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,11
100,7,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,4
101,3,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,3


Each location with associated cluster is then represented on the map with its associated color.

In [200]:
import matplotlib.cm as cm
import matplotlib.colors as colors

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(df_final['Latitude'], df_final['Longitude'], df_final['Neighborhood'], df_final['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Centroids of each clusters are then extracted and plotted on the map, in order to find the best location to live in Toronto and also the worst, based on the scoring.

In [201]:
df_centers = pd.DataFrame(kmeans.cluster_centers_)
df_centers.columns = ['Latitude', 'Longitude', 'Score']
df_centers

Unnamed: 0,Latitude,Longitude,Score
0,43.692133,-79.387688,16.5
1,43.713103,-79.406653,6.333333
2,43.697716,-79.386854,11.166667
3,43.716897,-79.450162,2.545455
4,43.691154,-79.390965,14.4
5,43.741697,-79.386959,0.375
6,43.678338,-79.418937,8.461538
7,43.695041,-79.366796,4.4
8,43.715843,-79.409255,18.5
9,43.723603,-79.334855,13.0


In [202]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, label, radius in zip(df_centers['Latitude'], df_centers['Longitude'], df_centers['Score'], df_centers['Score']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) 

map_toronto