<div style="background: yellow; color: blue; font-size: 2rem;"><h1>Segmenting and Clustering Neighborhoods in Toronto</h1></div>

<span style="padding: 10px; background: white; color: red; font-size: 1.5rem;">by <b>Santanu Sikder</b></span>

<h1><u>Part-1:</u> Scraping the data (table) from the Wikipedia page and preparing the dataframe</h1>

In [1]:
# Install lxml for reading HTML tables using pandas' read_html method
!pip install lxml



You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [2]:
# Import the necessary modules first
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import folium
from sklearn.cluster import KMeans

I'll use Pandas' **read_html** method to get the tables from the given link in the form of a list of dataframes.

Then I'll assign the first dataframe in the list (because that is what we want) to **dfMain**.

In [3]:
dfMain = pd.read_html("http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]

# Check if the dataframe has been read in successfully from the table
dfMain.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [4]:
# Rename the column "Postal Code" to "PostalCode"
dfMain.rename(columns = {"Postal Code" : "PostalCode"}, inplace = True)
# Preview the dataframe
dfMain.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [5]:
# Export it to a local file as csv file
# dfMain.to_csv("canada_postal_codes.csv")

In [6]:
# Drop such rows where the Borough is "Not assigned"
# The same can be accomplished by filtering the dataframe using a mask, but I'll use the drop method
df = dfMain.drop(dfMain[dfMain["Borough"] == "Not assigned"].index, axis = 0).reset_index(drop = True)
# Preview the dataframe
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


The scenario specified in the 3rd and 4th points in the 3rd step of the first part of the assignment instructions **do not exist**, i.e., neither there is any need to combine the Neighbourhood names using commas (as it is already done in the Wikipedia page) nor is there any row with only the Neighbourhood Not Assigned (as of August 2020).

Therefore, I'll move on skipping these two unnecessary steps

In [7]:
# Print out the shape of the dataframe
df.shape

(103, 3)

This is **THE END OF PART - 1**

<h1><u>Part-2:</u> Creating the dataframe containing the latitudes and longitudes of the postal codes</h1>

I'll use the CSV file whose link has been provided in the instructions for this assignment to create the required dataframe.

In [8]:
dfLatLong = pd.read_csv("geospatial_coordinates.csv").rename(columns = {"Postal Code" : "PostalCode"})
# Preview the dataframe
dfLatLong

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


I'll set the PostalCode columns in both df as well as dfLatLong as the **index**. This will be helpful in the next step.

In [9]:
df.set_index("PostalCode", inplace = True)
dfLatLong.set_index("PostalCode", inplace = True)
# Preview df
df.head()

Unnamed: 0_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [10]:
# Preview dfLatLong
dfLatLong.head()

Unnamed: 0_level_0,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


I'll now insert the latitudes and longitudes from the above dataframe to our df.

In [11]:
df[["Latitude", "Longitude"]] = dfLatLong
# Preview the dataframe
df.head()

Unnamed: 0_level_0,Borough,Neighbourhood,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3A,North York,Parkwoods,43.753259,-79.329656
M4A,North York,Victoria Village,43.725882,-79.315572
M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Finally, reset the index and obtain the desired dataframe

In [12]:
df.reset_index(inplace = True)
# Preview the dataframe
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


This is **THE END OF PART - 2**

<h1><u>Part-3:</u> Explore and cluster the neighborhoods, based on position/direction</h1>

First of all, I'll extract the rows from df containing "Toronto" in their Borough into a new dataframe **dfToronto**.

In [13]:
# Initialise a new dataframe
dfToronto = pd.DataFrame(df)

In [14]:
# Create a function which checks if the Borough name consists of "Toronto"
def hasToronto(row):
    return "Toronto" in row

In [15]:
# Apply this function on the dataframe to obtain a mask/filter for extracting such rows
hasTorontoMask = np.array(df["Borough"].apply(hasToronto))
# Check the mask
hasTorontoMask

array([False, False,  True, False,  True, False, False, False, False,
        True, False, False, False, False, False,  True, False, False,
       False,  True,  True, False, False, False,  True,  True, False,
       False, False, False,  True,  True, False, False, False, False,
        True,  True, False, False, False,  True,  True,  True, False,
       False, False,  True,  True, False, False, False, False, False,
        True, False, False, False, False, False, False,  True,  True,
       False, False, False, False,  True,  True,  True, False, False,
       False,  True,  True,  True, False, False, False,  True,  True,
        True, False,  True,  True, False,  True,  True, False, False,
       False,  True,  True, False, False, False,  True,  True, False,
        True,  True, False, False])

In [16]:
# Use the above mask to obtain the desired rows into dfToronto
dfToronto = df[hasTorontoMask]
# Reset the index
dfToronto.reset_index(drop = True, inplace = True)
# Preview the dataframe
dfToronto.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


I'll now generate the initial map of Toronto.

I've obtained the coordinates of Toronto from a Google search.

In [17]:
torontoCoords = (43.6532, -79.3832)

In [18]:
torontoMap = folium.Map(location = torontoCoords, zoom_start = 12)

# Now I'll add CircleMarkers for the different postal codes
for postalCode, lat, long in zip(dfToronto["PostalCode"], dfToronto["Latitude"], dfToronto["Longitude"]):
    folium.CircleMarker(
        (lat, long),
        radius = 7,
        opacity = 0.4,
        fill_opacity = 0.5,
        color = "blue",
        fill = True,
        fill_color = "red",
#         fill_opacity = 0.3,
        popup = postalCode
    ).add_to(torontoMap)
    
print("FOLIUM Map-1\n")
# Show the map
torontoMap

FOLIUM Map-1



I'll now check the different unique Boroughs in Toronto by using groupby

In [19]:
pd.DataFrame(dfToronto.groupby("Borough")["Borough"])

Unnamed: 0,0,1
0,Central Toronto,18 Central Toronto 19 Central Toronto 20...
1,Downtown Toronto,0 Downtown Toronto 1 Downtown Toronto ...
2,East Toronto,4 East Toronto 12 East Toronto 15 Ea...
3,West Toronto,9 West Toronto 11 West Toronto 14 We...


Since there are 4 Boroughs as depicted in the above dataframe, I'll try to divide the various postal codes into 4 clusters.

Module used will be sklearn; its KMeans class.

I'll use the "k-means++" init value for smart centroid choice and n_init will be set to 12 for more accurate results.

In [20]:
kmeans = KMeans(init = "k-means++", n_clusters = 4, n_init = 12)
# Fit a feature matrix with the latitudes and longitudes
kmeans.fit(dfToronto[["Latitude", "Longitude"]])

KMeans(n_clusters=4, n_init=12)

In [21]:
# Store the centroid coordinates and the labels into new variables
centroids, labels = kmeans.cluster_centers_, kmeans.labels_
# Print the above arrays
print(centroids, "\n\n", labels)

[[ 43.6547639  -79.38308287]
 [ 43.66943648 -79.32465436]
 [ 43.70563855 -79.39811351]
 [ 43.65506566 -79.44547176]] 

 [0 0 0 0 1 0 0 3 0 3 0 3 1 0 3 1 0 1 2 2 2 2 3 2 0 3 2 0 3 2 0 2 0 0 0 0 0
 0 1]


Now I'll use folium and the above arrays to show the centroids of the clusters, as well as all the points

In [22]:
torontoClusteredMap = folium.Map(location = torontoCoords, zoom_start = 12)

# Set the array of colors to be used to colour the points according to the cluster they belong to
colors = ["red", "blue", "green", "orange"]

# Add circle markers for the centroids
for i, centroid in enumerate(centroids):
    folium.CircleMarker(
        centroid,
        radius = 100,
        color = "white",
        fill = True,
        fill_color = colors[i],
        fill_opacity = 0.6,
        popup = "Cluster-%d"%(i + 1)
    ).add_to(torontoClusteredMap)

# Now I'll add CircleMarkers for the different postal codes
for postalCode, lat, long, labelIndex in zip(dfToronto["PostalCode"], dfToronto["Latitude"], dfToronto["Longitude"], labels):
    folium.CircleMarker(
        (lat, long),
        radius = 10,
        fill_opacity = 0.9,
        color = "black",
        fill = True,
        fill_color = colors[labelIndex],
        popup = postalCode
    ).add_to(torontoClusteredMap)
    
print("FOLIUM Map-2\n")
# Show the map
torontoClusteredMap

FOLIUM Map-2



I have increased the size of the cluster centroid markers so that the clusters are more easily visible.

This is **THE END OF PART - 3**