# Segmenting and Clustering Neighbourhoods of Toronto

## Table of Contents


1. [Scraping the Wikipedia page](#first-section)
2. [Getting coordinates](#second-section)
3. [Exploring and clustering neighbourhoods](#third-section)

# Part 1: Scraping the Wikipedia page <a class="anchor" id="first-section"></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

The following code is based on the following assumptions

* The webpage is provided at the given URL
* The relevant table is the first table on that page
* The headings are "Postal Code", "Borough", and "Neighbourhood" in this order
* Cells indicating no borough or no neighbourhood contain the text "Not assigned"

First, we use the read_html function of Pandas' dataframe class to extract the table.
After that, we weed out the rows without assigned boroughs. Next, the neighbourhood column of entries with not assigned neighbourhoods are set to the value in their borough column.
Finally, we reset the index and drop the old one.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df_toronto = pd.read_html(url)[0]
df_toronto = df_toronto[df_toronto.Borough != 'Not assigned']
df_toronto.loc[df_toronto['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = df_toronto['Borough']
df_toronto.reset_index(inplace=True)
df_toronto.drop('index', axis = 1, inplace=True)
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [3]:
df_toronto.shape

(103, 3)

So, there are 103 postal codes for Toronto. However, we will do our analysis based on neighbourhoods, and if you inspect the data, you will see that there are several postal codes covering exactly the same neighbourhoods. Therefore, we eliminate a few duplicates:

In [4]:
df_toronto.drop_duplicates(subset='Neighbourhood',inplace=True,)
df_toronto.shape

(99, 3)

In the end, we have 99 different sets of neighbourhoods.

# Part 2: Getting coordinates <a class="anchor" id="second-section"></a>

First, we install geocoder.

In [6]:
!pip install geocoder
import geocoder

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 8.1 MB/s  eta 0:00:01
[?25hCollecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


For each postal code in the dataframe, we look up its coordinates.
As using geocoder with the Google API did not deliver any results, I switched to ArcGIS.
The results are first stored in two lists for latitudes and longitudes, respectively.
They are eventually appended to the dataframe as additional columns.

In [7]:
lats = []
longs = []
for postal_code in df_toronto['Postal Code']:
    lat_lng = None
    print("Postcode %s..." % (postal_code))
    while(lat_lng is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        lat_lng = g.latlng
    lats.append(lat_lng[0])
    longs.append(lat_lng[1])
    print("done.")
df_toronto['Latitude'] = lats
df_toronto['Longitude'] = longs

Postcode M3A...
done.
Postcode M4A...
done.
Postcode M5A...
done.
Postcode M6A...
done.
Postcode M7A...
done.
Postcode M9A...
done.
Postcode M1B...
done.
Postcode M3B...
done.
Postcode M4B...
done.
Postcode M5B...
done.
Postcode M6B...
done.
Postcode M9B...
done.
Postcode M1C...
done.
Postcode M4C...
done.
Postcode M5C...
done.
Postcode M6C...
done.
Postcode M9C...
done.
Postcode M1E...
done.
Postcode M4E...
done.
Postcode M5E...
done.
Postcode M6E...
done.
Postcode M1G...
done.
Postcode M4G...
done.
Postcode M5G...
done.
Postcode M6G...
done.
Postcode M1H...
done.
Postcode M2H...
done.
Postcode M3H...
done.
Postcode M4H...
done.
Postcode M5H...
done.
Postcode M6H...
done.
Postcode M1J...
done.
Postcode M2J...
done.
Postcode M3J...
done.
Postcode M4J...
done.
Postcode M5J...
done.
Postcode M6J...
done.
Postcode M1K...
done.
Postcode M2K...
done.
Postcode M3K...
done.
Postcode M4K...
done.
Postcode M5K...
done.
Postcode M6K...
done.
Postcode M1L...
done.
Postcode M2L...
done.
Postcode M

In [8]:
df_toronto.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.66263,-79.52831
6,M1B,Scarborough,"Malvern, Rouge",43.81139,-79.19662
7,M3B,North York,Don Mills,43.74923,-79.36186
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.70718,-79.31192
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804


Let's have a quick view at the number of neighbourhoods per borough.

In [9]:
df_toronto['Borough'].value_counts()

North York          20
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East Toronto         5
York                 5
East York            5
Mississauga          1
Name: Borough, dtype: int64

We now install geopy and folium for visualization.

In [10]:
!conda install -c conda-forge geopy --yes 
!conda install -c conda-forge folium=0.5.0 --yes 
from geopy.geocoders import Nominatim 
import folium 

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python-3.7-main

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |      conda_forge           3 KB  conda-forge
    _openmp_mutex-4.5          |           1_llvm           5 KB  conda-forge
    _py-xgboost-mutex-2.0      |            cpu_0           8 KB  conda-forge
    _pytorch_select-0.2        |            gpu_0           2 KB
    absl-py-0.11.0          

gst-plugins-base-1.1 | 2.5 MB    | ##################################### | 100% 
jpeg-9d              | 264 KB    | ##################################### | 100% 
mock-4.0.3           | 51 KB     | ##################################### | 100% 
lz4-c-1.9.3          | 179 KB    | ##################################### | 100% 
importlib-metadata-3 | 20 KB     | ##################################### | 100% 
krb5-1.17.2          | 1.4 MB    | ##################################### | 100% 
libedit-3.1.20191231 | 121 KB    | ##################################### | 100% 
expat-2.2.10         | 164 KB    | ##################################### | 100% 
olefile-0.46         | 32 KB     | ##################################### | 100% 
py-xgboost-1.3.0     | 123 KB    | ##################################### | 100% 
libzopfli-1.0.3      | 164 KB    | ##################################### | 100% 
h5py-3.1.0           | 1.2 MB    | ##################################### | 100% 
libgcc-ng-9.3.0      | 7.8 M

pyzmq-22.0.3         | 526 KB    | ##################################### | 100% 
pyodbc-4.0.30        | 71 KB     | ##################################### | 100% 
_pytorch_select-0.2  | 2 KB      | ##################################### | 100% 
cffi-1.14.5          | 225 KB    | ##################################### | 100% 
keyring-18.0.0       | 50 KB     | ##################################### | 100% 
libxslt-1.1.33       | 522 KB    | ##################################### | 100% 
soupsieve-2.0.1      | 30 KB     | ##################################### | 100% 
freetds-1.1.15       | 2.4 MB    | ##################################### | 100% 
opt_einsum-3.3.0     | 51 KB     | ##################################### | 100% 
requests-2.25.1      | 51 KB     | ##################################### | 100% 
sqlalchemy-1.3.23    | 1.8 MB    | ##################################### | 100% 
requests-oauthlib-1. | 21 KB     | ##################################### | 100% 
click-7.1.2          | 64 KB

libssh2-1.9.0        | 225 KB    | ##################################### | 100% 
cached-property-1.5. | 10 KB     | ##################################### | 100% 
cachetools-4.2.1     | 13 KB     | ##################################### | 100% 
typing-extensions-3. | 8 KB      | ##################################### | 100% 
tifffile-2021.2.1    | 126 KB    | ##################################### | 100% 
beautifulsoup4-4.9.3 | 86 KB     | ##################################### | 100% 
_libgcc_mutex-0.1    | 3 KB      | ##################################### | 100% 
seaborn-0.11.1       | 4 KB      | ##################################### | 100% 
blinker-1.4          | 13 KB     | ##################################### | 100% 
networkx-2.5         | 1.2 MB    | ##################################### | 100% 
imagecodecs-2021.1.1 | 6.6 MB    | ##################################### | 100% 
libcurl-7.71.1       | 312 KB    | ##################################### | 100% 
entrypoints-0.3      | 8 KB 

done
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python-3.7-main

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    branca-0.4.2               |     pyhd8ed1ab_0          26 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         713 KB

The following NEW packages will be INSTALLED:

  altair          

Let's create a overview map showing the neighbourhoods of Toronto.

In [11]:
geolocator = Nominatim(user_agent="toronto")
location = geolocator.geocode("Toronto, Ontario")
map = folium.Map(location=[location.latitude, location.longitude], zoom_start=11)

for code, lat, lng, borough, neighborhood in zip(df_toronto['Postal Code'], df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighbourhood']):
    label = folium.Popup(neighborhood, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map)
map

# Part 3: Exploring and Clustering of Neighbourhoods  <a class="anchor" id="third-section"></a>

Main idea of this section is to explore and cluster the neighbourhoods based on the frequency of different types of venues. We use the Foursquare API to retrieve data for this purpose.

In [12]:
# @hidden_cell 
CLIENT_ID = 'LQ0SC02F1KVJH2VMHBRYPA0SOY2CJLHSNUKQ1KQM0MFDSIZS' # your Foursquare ID
CLIENT_SECRET = 'XXQCM3SWEYIYHEWVFM0WUFLJU0CZ5GX4UIKEHZY1ZXBBNMKA' # your Foursquare Secret

Define a function for exploring the surroundings of a given list of locations provided as names and coordinates. It returns a dataframe describing nearby venues. Essentially copied from an earlier lab.

In [13]:
import requests

VERSION = '20210222' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.extend([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame(venues_list)
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

For all the neighbourhoods, we now retrieve the venues up to 500m away from the coordinates stored for each neighbourhood.

In [14]:
toronto_venues = getNearbyVenues(names=df_toronto['Neighbourhood'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmount Park
B

Let's see how many venues we found and how the dataframe looks like.

In [15]:
print(toronto_venues.shape)
toronto_venues.head()

(2327, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.75245,-79.32991,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.75245,-79.32991,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Parkwoods,43.75245,-79.32991,TTC stop #8380,43.752672,-79.326351,Bus Stop
3,Victoria Village,43.73057,-79.31306,Wigmore Park,43.731023,-79.310771,Park
4,Victoria Village,43.73057,-79.31306,Memories of Africa,43.726602,-79.312427,Grocery Store


Next, we transform the data contained in the venue category column into an one-hot encoding. We also add the neighbourhood column and make it the first column.
To do so, we identify its index i (where it was inserted), make it the first column, append all columns with index < i, followed by all columns with index > i.

In [16]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
n_ind = toronto_onehot.columns.get_loc("Neighborhood")
if n_ind != 0:
    fixed_columns = [toronto_onehot.columns[n_ind]] + \
        list(toronto_onehot.columns[:n_ind]) + \
        list(toronto_onehot.columns[n_ind + 1:])
    toronto_onehot = toronto_onehot[fixed_columns]
toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


This preview is not very informative as all entries are zero. So is the next output in which we have a peak at the relative frequencies of venue categories per neighbourhood.

In [17]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
94,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,York Mills West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Again, we only see zeroes. Let's  define a function which returns the top x veneu categories for a row (read: neighbourhood) in a dataframe like the above:

In [18]:
def top_x(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

print("Most frequent in  %s: %s." % (toronto_grouped.iloc[1,0], top_x(toronto_grouped.iloc[1,], 5)))
print("Most frequent in  %s: %s." % (toronto_grouped.iloc[45,0], top_x(toronto_grouped.iloc[45,], 5)))

Most frequent in  Alderwood, Long Branch: ['Convenience Store' 'Gym' 'Performing Arts Venue' 'Pub'
 'Ethiopian Restaurant'].
Most frequent in  Kingsview Village, St. Phillips, Martin Grove Gardens, Richview Gardens: ['Pizza Place' 'Bus Line' 'Arts & Crafts Store' 'Fast Food Restaurant'
 'Farmers Market'].


It seems there are not just zeroes in the data! The neighbourhoods differ in their venue category profiles.

We now create a dataframe that lists the top-ten venue categories per neighbourhood.

In [27]:
num_top_venues = 10

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    columns.append('cat_{}_venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = top_x(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,cat_1_venue,cat_2_venue,cat_3_venue,cat_4_venue,cat_5_venue,cat_6_venue,cat_7_venue,cat_8_venue,cat_9_venue,cat_10_venue
0,Agincourt,Chinese Restaurant,Bubble Tea Shop,Hong Kong Restaurant,Supermarket,Bakery,Badminton Court,Discount Store,Department Store,Shopping Mall,Grocery Store
1,"Alderwood, Long Branch",Convenience Store,Gym,Performing Arts Venue,Pub,Ethiopian Restaurant,Dry Cleaner,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Escape Room
2,"Bathurst Manor, Wilson Heights, Downsview North",Home Service,Men's Store,Business Service,Yoga Studio,Event Service,Eastern European Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Falafel Restaurant
3,Bayview Village,Trail,Construction & Landscaping,Park,Dog Run,Dry Cleaner,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant
4,"Bedford Park, Lawrence Manor East",Coffee Shop,Italian Restaurant,Sandwich Place,Butcher,Juice Bar,Indian Restaurant,Pub,Thai Restaurant,Sports Club,Café


Let's prepare the actual clustering. After some experimentation, I decided to go for eight clusters for all of Toronto. We drop the neighbourhood column as it is not relevant for the clustering and run k-means with fixed initial state for reproduceable results.

In [28]:
# set number of clusters
kclusters = 8
toronto_grouped_venues_only = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=45).fit(toronto_grouped_venues_only)

In [29]:
print(toronto_grouped_venues_only.shape)

(98, 267)


We originally extracted 99 neighbourhoods from Wikipedia, so is the clustering acting weird, only clustering 98 of them?

No, this is due to Foursquare not having returned any venues for one neighbourhood. You might remember thath we stored venue data in
neighbourhoods_venues_sorted, so have a look at its shape:

In [30]:
print("Neighbourhoods with venues: ", neighborhoods_venues_sorted.shape[0])
print("Total number of neighbourhoods: ", df_toronto.shape[0])

Neighbourhoods with venues:  98
Total number of neighbourhoods:  99


For further analysis, we create a dataframe combining data about location, prevelant venue categories, and clusters. To clean up the data a bit, we also do the following:
* Get finally rid of the American/British English difference in the column name 'Neighbo(u)rhood' :)
* For each neighbourhood, add the label of the cluster it belongs to (according to k-means).
* Remove unclustered neighbourhoods (see explanation above)
* Change the type of the cluster label to integer as it appears to be set to float at least on my machine

In [31]:
# add clustering labels
df_toronto_merged = df_toronto
df_toronto_merged.rename(columns={'Neighbourhood':'Neighborhood'}, inplace=True)
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

df_toronto_merged = df_toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
df_toronto_merged = df_toronto_merged[df_toronto_merged['Cluster Labels'].isnull() == False]
df_toronto_merged = df_toronto_merged.astype({'Cluster Labels': 'int'})
df_toronto_merged

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,cat_1_venue,cat_2_venue,cat_3_venue,cat_4_venue,cat_5_venue,cat_6_venue,cat_7_venue,cat_8_venue,cat_9_venue,cat_10_venue
0,M3A,North York,Parkwoods,43.75245,-79.32991,1,Food & Drink Shop,Park,Bus Stop,Yoga Studio,Event Service,Eastern European Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Falafel Restaurant
1,M4A,North York,Victoria Village,43.73057,-79.31306,1,German Restaurant,Grocery Store,Park,Yoga Studio,Ethiopian Restaurant,Dry Cleaner,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Escape Room
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264,4,Coffee Shop,Restaurant,Breakfast Spot,Yoga Studio,Thai Restaurant,Health Food Store,Italian Restaurant,Food Truck,Event Space,Electronics Store
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042,4,Clothing Store,Men's Store,American Restaurant,Women's Store,Furniture / Home Store,Restaurant,Bookstore,Food Court,Cosmetics Shop,Toy / Game Store
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188,4,Coffee Shop,Sandwich Place,Bank,Falafel Restaurant,Fried Chicken Joint,Gastropub,Theater,Café,Mediterranean Restaurant,Burrito Place
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.65319,-79.51113,4,Lounge,Pool,Yoga Studio,Event Service,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Falafel Restaurant
99,M4Y,Downtown Toronto,Church and Wellesley,43.66659,-79.38133,4,Coffee Shop,Japanese Restaurant,Restaurant,Sushi Restaurant,Gay Bar,Café,Pub,Pizza Place,Dance Studio,Men's Store
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.64869,-79.38544,4,Coffee Shop,Hotel,Café,Restaurant,Asian Restaurant,Salad Place,Theater,Steakhouse,Gym,Japanese Restaurant
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.63278,-79.48945,4,Flower Shop,Coffee Shop,Fast Food Restaurant,Sushi Restaurant,Italian Restaurant,Chinese Restaurant,Bank,Park,Dance Studio,Dumpling Restaurant


We now define two things for cluster visualization on the map of Toronto:

1. A colour map for colouring neighbourhood markers according to the cluster the neighbourhood belongs to.
2. We define a function that takes as input a dataframe like the one we defined lately, a map centre, and zoom level.

The function creates a map showing the clusters as neighbourhood markers in different colors

In [32]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# set color scheme for the clusters
colourmap = cm.gist_ncar
colours = [colourmap(i) for i in np.linspace(0, 0.9, kclusters)] 

# add markers to the map
def plot_clusters(df, start_loc, zoom=11):
    map = folium.Map(location=start_loc, zoom_start=zoom)
    markers_colors = []
    for lat, lon, poi, cluster in zip(df['Latitude'], df['Longitude'], df['Neighborhood'], df['Cluster Labels']):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=colors.rgb2hex(colours[cluster]),
            fill=True,
            fill_color=colors.rgb2hex(colours[cluster]),
            fill_opacity=0.7).add_to(map)  
    return map

Let's put this function to use and have a look at the how the clusters look like on the map!

In [33]:
plot_clusters(df_toronto_merged, start_loc=[location.latitude, location.longitude])

It looks like one cluster is very dominant, especially downtown but also strong in all other areas of the city. The other clusters seem to be much smaller. The first visual impression also suggests that most clusters are scattered over the city area such that generalizing statements based on geographic location do not seem possible (no "The east of Toronto is like this, the west like that" statements. Let's see how big the clusters are in comparison:

In [34]:
df_toronto_merged.groupby(['Cluster Labels']).size()

Cluster Labels
0     2
1    20
2    10
3     3
4    59
5     1
6     1
7     2
dtype: int64

In the following, we discuss the four largest cluster in descending order of their size. We consider clusters with one or two neighbourhoods as too small for reliable conclusions.

## Cluster 4: Fancy a coffee and something to eat?

In this cluster, several sorts of restaurants, coffee shops, and cafés are very strong. In all neighbourhoods such venues are very often among the top-5 types of venues.
While spreading out into all areas of the city, it is particularly prevalent around downtown Toronto. Have a look at the map and table below.

In [35]:
columns = ['Neighborhood', 'Cluster Labels', 'cat_1_venue', 'cat_2_venue', 'cat_3_venue', 'cat_4_venue', 'cat_5_venue']
plot_clusters(df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 4], start_loc=[location.latitude, location.longitude])

In [39]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 4, columns]

Unnamed: 0,Neighborhood,Cluster Labels,cat_1_venue,cat_2_venue,cat_3_venue,cat_4_venue,cat_5_venue
2,"Regent Park, Harbourfront",4,Coffee Shop,Restaurant,Breakfast Spot,Yoga Studio,Thai Restaurant
3,"Lawrence Manor, Lawrence Heights",4,Clothing Store,Men's Store,American Restaurant,Women's Store,Furniture / Home Store
4,"Queen's Park, Ontario Provincial Government",4,Coffee Shop,Sandwich Place,Bank,Falafel Restaurant,Fried Chicken Joint
9,"Garden District, Ryerson",4,Coffee Shop,Clothing Store,Japanese Restaurant,Italian Restaurant,Middle Eastern Restaurant
10,Glencairn,4,Pizza Place,Grocery Store,Italian Restaurant,Pub,Mediterranean Restaurant
12,"Rouge Hill, Port Union, Highland Creek",4,Construction & Landscaping,Bar,Event Service,Dumpling Restaurant,Eastern European Restaurant
14,Woodbine Heights,4,Grocery Store,Pharmacy,Café,Bus Line,Gas Station
15,St. James Town,4,Coffee Shop,Café,Italian Restaurant,Gastropub,Cosmetics Shop
19,The Beaches,4,Asian Restaurant,Trail,Health Food Store,Pub,Dry Cleaner
20,Berczy Park,4,Coffee Shop,Bakery,Cocktail Bar,Seafood Restaurant,Farmers Market


## Cluster 1: The Green Cluster

The striking aspect about this cluster is that place to enjoy the city outside are far on top in these neighbourhoods - all of them have parks, fields, trails or the like in the top-5 venues.
So these might be places for the outdoor folks. Other than that it still shows a large variety of gastronomy.

In [40]:
plot_clusters(df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 1], start_loc=[location.latitude, location.longitude])

In [41]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 1, columns]

Unnamed: 0,Neighborhood,Cluster Labels,cat_1_venue,cat_2_venue,cat_3_venue,cat_4_venue,cat_5_venue
0,Parkwoods,1,Food & Drink Shop,Park,Bus Stop,Yoga Studio,Event Service
1,Victoria Village,1,German Restaurant,Grocery Store,Park,Yoga Studio,Ethiopian Restaurant
7,Don Mills,1,Gas Station,Soccer Field,Burger Joint,Park,Ethiopian Restaurant
16,Humewood-Cedarvale,1,Field,Hockey Arena,Grocery Store,Park,Trail
17,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",1,Fish & Chips Shop,Grocery Store,Electronics Store,Shopping Mall,College Rec Center
18,"Guildwood, Morningside, West Hill",1,Construction & Landscaping,Gym / Fitness Center,Park,Event Service,Dry Cleaner
22,Woburn,1,Coffee Shop,Korean BBQ Restaurant,Park,Business Service,Yoga Studio
32,Scarborough Village,1,Spa,Grocery Store,Park,Restaurant,Indian Restaurant
35,"East Toronto, Broadview North (Old East York)",1,Bus Stop,Intersection,Park,Yoga Studio,Dumpling Restaurant
39,Bayview Village,1,Trail,Construction & Landscaping,Park,Dog Run,Dry Cleaner


## Cluster 2: The Fastfood Centre?

It's striking that fastfood is very domintant in this centre. Fast food venues and pizza places are the top venues in seven out of ten neighourhoods here (along with ice cream shops in one and pharmacies in two neighbourhoods).

In [42]:
plot_clusters(df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 2], start_loc=[location.latitude, location.longitude])

In [43]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 2, columns]

Unnamed: 0,Neighborhood,Cluster Labels,cat_1_venue,cat_2_venue,cat_3_venue,cat_4_venue,cat_5_venue
5,"Islington Avenue, Humber Valley Village",2,Pharmacy,Bank,Home Service,Café,Shopping Mall
8,"Parkview Hill, Woodbine Gardens",2,Pizza Place,Pet Store,Athletics & Sports,Breakfast Spot,Flea Market
11,"West Deane Park, Princess Gardens, Martin Grov...",2,Pizza Place,Chinese Restaurant,Sandwich Place,Tea Room,Costume Shop
51,"Cliffside, Cliffcrest, Scarborough Village West",2,Ice Cream Shop,Discount Store,Coffee Shop,Sandwich Place,Restaurant
64,Weston,2,Pizza Place,Pharmacy,Diner,Fried Chicken Joint,Escape Room
70,Westmount,2,Pizza Place,Coffee Shop,Chinese Restaurant,Sandwich Place,Middle Eastern Restaurant
77,"Kingsview Village, St. Phillips, Martin Grove ...",2,Pizza Place,Bus Line,Arts & Crafts Store,Fast Food Restaurant,Farmers Market
82,"Clarks Corners, Tam O'Shanter, Sullivan",2,Fast Food Restaurant,Pharmacy,Golf Course,Bus Stop,Sandwich Place
85,"Milliken, Agincourt North, Steeles East, L'Amo...",2,Pharmacy,Intersection,Yoga Studio,Ethiopian Restaurant,Dry Cleaner
90,"Steeles West, L'Amoreaux West",2,Fast Food Restaurant,Gym Pool,Pharmacy,Burger Joint,Pizza Place


## Cluster 3: The Three Sisters

These three neighbourhoods are very similar: all three have yoga studios, parks and dry cleaners in their top-5, event services and dumping restaurants are each in two of them in the top-5. 

In [44]:
plot_clusters(df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 3], start_loc=[location.latitude, location.longitude])

In [45]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 3, columns]

Unnamed: 0,Neighborhood,Cluster Labels,cat_1_venue,cat_2_venue,cat_3_venue,cat_4_venue,cat_5_venue
27,Hillcrest Village,3,Park,Residential Building (Apartment / Condo),Yoga Studio,Ethiopian Restaurant,Dry Cleaner
45,"York Mills, Silver Hills",3,Park,Yoga Studio,Event Service,Dry Cleaner,Dumpling Restaurant
68,"Forest Hill North & West, Forest Hill Road Park",3,Event Service,Park,Yoga Studio,Dry Cleaner,Dumpling Restaurant


## Remaining clusters (without deeper discussion)

### Cluster 0

In [47]:
plot_clusters(df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 0], start_loc=[location.latitude, location.longitude])

In [48]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 0, columns]

Unnamed: 0,Neighborhood,Cluster Labels,cat_1_venue,cat_2_venue,cat_3_venue,cat_4_venue,cat_5_venue
28,"Bathurst Manor, Wilson Heights, Downsview North",0,Home Service,Men's Store,Business Service,Yoga Studio,Event Service
62,Roselawn,0,Home Service,Yoga Studio,Event Space,Dumpling Restaurant,Eastern European Restaurant


### Cluster 5

In [49]:
plot_clusters(df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 5], start_loc=[location.latitude, location.longitude])

In [50]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 5, columns]

Unnamed: 0,Neighborhood,Cluster Labels,cat_1_venue,cat_2_venue,cat_3_venue,cat_4_venue,cat_5_venue
6,"Malvern, Rouge",5,Fast Food Restaurant,Yoga Studio,Donut Shop,Fish & Chips Shop,Field


### Cluster 6

In [51]:
plot_clusters(df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 6], start_loc=[location.latitude, location.longitude])

In [52]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 6, columns]

Unnamed: 0,Neighborhood,Cluster Labels,cat_1_venue,cat_2_venue,cat_3_venue,cat_4_venue,cat_5_venue
63,"Runnymede, The Junction North",6,Furniture / Home Store,Brewery,Seafood Restaurant,Ethiopian Restaurant,Dumpling Restaurant


### Cluster 7

In [53]:
plot_clusters(df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 7], start_loc=[location.latitude, location.longitude])

In [54]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 7, columns]

Unnamed: 0,Neighborhood,Cluster Labels,cat_1_venue,cat_2_venue,cat_3_venue,cat_4_venue,cat_5_venue
50,Humber Summit,7,Rental Car Location,Auto Garage,Yoga Studio,Dog Run,Dry Cleaner
71,"Wexford, Maryvale",7,Auto Garage,Yoga Studio,Event Service,Dumpling Restaurant,Eastern European Restaurant
