# Segmenting and Clustering Neighborhoods in Toronto 
### Author: Rishabh Dingliwal
<b> This notebook contains the solutions for Coursera IBM Applied Data Science Capstone Week 3 Final Assignment

## Part 1

<b> Importing the required libraries </b>

In [1]:
import pandas as pd
import requests

<b> Connecting to the Wiki webpage and requesting data </b>

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_url = requests.get(url)
wiki_url

<Response [200]>

<b> Extracting the tables using pandas </b>

In [3]:
wiki_data = pd.read_html(wiki_url.text)
wiki_data

[    Postal Code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 ..          ...               ...   
 175         M5Z      Not assigned   
 176         M6Z      Not assigned   
 177         M7Z      Not assigned   
 178         M8Z         Etobicoke   
 179         M9Z      Not assigned   
 
                                          Neighbourhood  
 0                                         Not assigned  
 1                                         Not assigned  
 2                                            Parkwoods  
 3                                     Victoria Village  
 4                            Regent Park, Harbourfront  
 ..                                                 ...  
 175                                       Not assigned  
 176                                       Not assigned  
 177                

In [4]:
len(wiki_data), type(wiki_data)

(3, list)

<b> The extracted data is in the form of a list with 3 values - each of which is a table. We only need the first table, so dropping the last 2 tables </b>

In [5]:
wiki_data = wiki_data[0]
wiki_data

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


<b> Dropping the rows which have 'Borough' as 'Not assigned' </b>

In [6]:
df = wiki_data[wiki_data['Borough'] != 'Not assigned']
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


<b> Grouping the dataframe by 'Postal Code' so that neighbourhoods with same postal codes are combined together into a single row </b>

In [7]:
df.groupby(by = 'Postal Code')
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


<b> Counting the number of rows having 'Neighbourhood' as 'Not assigned' </b> 

In [8]:
df.Neighbourhood.str.count('Not assigned').sum()

0

<b> Zero rows have a the value 'Not assigned' for the column 'Neighbourhood', so no need to drop anymore rows </b> <br>
<b> Resetting the index of the dataframe </b>

In [9]:
df = df.reset_index(drop = True)
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


<b> Printing out the shape of the dataframe </b>

In [10]:
df.shape

(103, 3)

<b> We can see that the dataframe has 103 rows and 3 colums </b>

## Part 2

<b> Instead of the geocoder library I will be using the given CSV file, as suggested in the assignment </b> <br>
<b> Loading the CSV file into a pandas dataframe </b>

In [11]:
latlong_data = pd.read_csv('https://cocl.us/Geospatial_data')
latlong_data

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


<b> Checking the shapes of our two dataframes, to check compatibility for merging </b>

In [12]:
print('Shape of our Wiki data: ', df.shape)
print('Shape of our Latitude-Longitude data: ', latlong_data.shape)

Shape of our Wiki data:  (103, 3)
Shape of our Latitude-Longitude data:  (103, 3)


<b> Joining/merging the two dataframes </b>

In [13]:
df = pd.merge(df, latlong_data, on = 'Postal Code')
df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


<b> Printing out the shape of the combined dataframe </b>

In [14]:
df.shape

(103, 5)

## Part 3

<b> Taking inspiration from the previous labs where we analysed the clusetrs of neighbourhoods in New York, we cluster based on similarities of the venues category using the K-Means algorithm and the Foursquare API  </b> 

<b> Downloading and Importing the required libraries </b>

In [15]:
!pip install folium

from geopy.geocoders import Nominatim
import folium



<b> Getting coordinates of Toronto </b>

In [16]:
address = 'Toronto, Ontario'

geolocater = Nominatim(user_agent = 'toronto_explorer')
location = geolocater.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The coordinates of Toronto are', latitude, longitude)

The coordinates of Toronto are 43.6534817 -79.3839347


<b> Visualizing the map of Toronto with the neighbourhoods marked </b>

In [17]:
map_toronto = folium.Map(location = [latitude, longitude], zoom_start = 11)
for latitude, longitude, borough, neighbourhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label_string = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label_string, parse_html = True)
    folium.CircleMarker([latitude, longitude], radius = 5, popup = label, color = 'red', fill = True).add_to(map_toronto)
    
map_toronto

<b> Defining the Foursquare API credentials</b>

In [18]:
CLIENT_ID = 'XFNQJ2BXRCL4SSOWPTZXQ5UAMCYLZXOMRD3DMDVMQUWVQGBP'
CLIENT_SECRET = 'JI2XUJEWV0OECP1Q21RYSHIHUSGGJQQEC1CZBRPGZ3B22BUH'
VERSION = '20180605'

print('My Client ID:', CLIENT_ID)
print('My Client Secret:', CLIENT_SECRET)

My Client ID: XFNQJ2BXRCL4SSOWPTZXQ5UAMCYLZXOMRD3DMDVMQUWVQGBP
My Client Secret: JI2XUJEWV0OECP1Q21RYSHIHUSGGJQQEC1CZBRPGZ3B22BUH


<b> Creating a function to get venue categories in Toronto </b>

In [19]:
def getNearbyVenues(names, latitude, longitude, radius = 500):
    venues_list = []
    
    for name, lat, long in zip(names, latitude, longitude):
        print(name)
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, long, radius)
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues_list.append([(name, lat, long, v['venue']['name'], v['venue']['categories'][0]['name']) for v in results])
    
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 'Neighbourhood Latitude', 'Neighbourhood Longitude', 'Venue', 'Venue Category']
    
    return (nearby_venues)

<b> Finding the venues in Toronto for each neighbourhood </b>

In [20]:
venues_in_toronto = getNearbyVenues(df['Neighbourhood'], df['Latitude'], df['Longitude'])

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

<b> Finding total number of venues </b>

In [21]:
venues_in_toronto.shape

(1314, 5)

<b> Printing the first 5 values of the total venues dataframe </b>

In [22]:
venues_in_toronto.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Portugril,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Tim Hortons,Coffee Shop


<b> Checking the values based on 'Neighbourhood' </b>

In [23]:
venues_in_toronto.groupby('Neighbourhood').head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Portugril,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Tim Hortons,Coffee Shop
...,...,...,...,...,...
1299,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Wingporium,Wings Joint
1300,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,South St. Burger,Burger Joint
1301,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Dollarama,Discount Store
1302,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Healthy Planet,Supplement Shop


<b> Checking for maximum number of venue category </b>

In [24]:
venues_in_toronto.groupby('Venue Category').max()

Unnamed: 0_level_0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Accessories Store,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,Ardene Shoes Outlet
Airport,Downsview,43.737473,-79.394420,Toronto Downsview Airport (YZD)
Airport Food Court,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.394420,Billy Bishop Café
Airport Gate,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.394420,Gate 8
Airport Lounge,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.394420,Porter Lounge
...,...,...,...,...
Warehouse Store,Thorncliffe Park,43.705369,-79.349372,Costco
Wine Bar,"Toronto Dominion Centre, Design Exchange",43.653206,-79.379817,The National Club
Wings Joint,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Wingporium
Women's Store,Caledonia-Fairbanks,43.689026,-79.453512,Maximum Woman


<b> There are 239 different types of venue categories, with 'Accessories Store' having the maximum number of venues </b>

<b> One Hot Encoding the categorical features </b>

In [25]:
toronto_venues_categorical = pd.get_dummies(venues_in_toronto[['Venue Category']], prefix = '', prefix_sep = '')
toronto_venues_categorical

Unnamed: 0,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1309,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1310,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1311,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1312,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<b> Adding neighbourhood to the encoded dataframe </b>

In [26]:
toronto_venues_categorical['Neighbourhood'] = venues_in_toronto['Neighbourhood']

new_columns = [toronto_venues_categorical.columns[-1]] + list(toronto_venues_categorical.columns[:-1])
toronto_venues_categorical = toronto_venues_categorical[new_columns]

toronto_venues_categorical

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1309,"Mimico NW, The Queensway West, South of Bloor,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1310,"Mimico NW, The Queensway West, South of Bloor,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1311,"Mimico NW, The Queensway West, South of Bloor,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1312,"Mimico NW, The Queensway West, South of Bloor,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<b> Grouping dataframe by 'Neighbourhood', calculate the mean venue categories in each neighbourhood </b>

In [27]:
toronto_grouped = toronto_venues_categorical.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
93,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
94,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95,York Mills West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<b> Creating a function to get the top most common venue categories </b>

In [28]:
def most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending = False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

<b> Importing required libraries </b>

In [29]:
import numpy as np

<b> There are too many venue categories. So we atke only top 10 to cluster the neighbourhoods </b>

In [30]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

columns = ['Neighbourhood']
for ind in np.arange(num_top_venues): # Equivalent to for ind in range(0, num_top_venues)
    try:
        columns.append('{}{} Most Common Venue'.format(ind + 1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind + 1))

neighbourhoods_venues_sorted = pd.DataFrame(columns = columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Lounge,Latin American Restaurant,Chinese Restaurant,Breakfast Spot,Curling Ice,Donut Shop,Dog Run,Distribution Center,Discount Store,Diner
1,"Alderwood, Long Branch",Pizza Place,Pharmacy,Gym,Pub,Coffee Shop,Gastropub,Convenience Store,Diner,Dim Sum Restaurant,Dessert Shop
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Shopping Mall,Pharmacy,Ice Cream Shop,Supermarket,Restaurant,Middle Eastern Restaurant,Sandwich Place,Deli / Bodega
3,Bayview Village,Chinese Restaurant,Bank,Japanese Restaurant,Café,Yoga Studio,Curling Ice,Donut Shop,Dog Run,Distribution Center,Discount Store
4,"Bedford Park, Lawrence Manor East",Coffee Shop,Sandwich Place,Italian Restaurant,Thai Restaurant,Comfort Food Restaurant,Pet Store,Pharmacy,Pizza Place,Café,Pub
...,...,...,...,...,...,...,...,...,...,...,...
92,"Willowdale, Willowdale West",Pharmacy,Grocery Store,Butcher,Pizza Place,Coffee Shop,College Gym,Cuban Restaurant,Dog Run,Distribution Center,Discount Store
93,Woburn,Coffee Shop,Korean BBQ Restaurant,Cuban Restaurant,Donut Shop,Dog Run,Distribution Center,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
94,Woodbine Heights,Curling Ice,Bus Stop,Park,Beer Store,Skating Rink,Intersection,Athletics & Sports,Dog Run,Distribution Center,Discount Store
95,York Mills West,Park,Convenience Store,Yoga Studio,Cuban Restaurant,Donut Shop,Dog Run,Distribution Center,Discount Store,Diner,Dim Sum Restaurant


<b> Importing the required library for K-Means </b>

In [31]:
from sklearn.cluster import KMeans

<b> Running the K-Means algorithm on the dataframe </b>

In [32]:
k = 5
toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

kmeans = KMeans(n_clusters = k, random_state = 0)
kmeans.fit(toronto_grouped_clustering)
kmeans

KMeans(n_clusters=5, random_state=0)

<b> Checking the labels of the model </b>

In [33]:
kmeans.labels_

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2,
       3, 2, 2, 2, 2, 2, 0, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3,
       2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2,
       3, 2, 3, 2, 2, 2, 2, 3, 4], dtype=int32)

<b> Adding clustering label to the dataframe </b>

In [34]:
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

<b> Joining neighbourhoods_venues_sorted and df on neighbourhood to prepare for plotting </b>

In [35]:
toronto_merged = df.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on = 'Neighbourhood')
toronto_merged

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,3.0,Park,Food & Drink Shop,Yoga Studio,Creperie,Dog Run,Distribution Center,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Pizza Place,Coffee Shop,Intersection,Hockey Arena,Portuguese Restaurant,Yoga Studio,Department Store,Curling Ice,Dance Studio,Deli / Bodega
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,2.0,Coffee Shop,Park,Bakery,Theater,Breakfast Spot,Performing Arts Venue,Chocolate Shop,Pub,Café,Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,2.0,Clothing Store,Accessories Store,Furniture / Home Store,Boutique,Event Space,Coffee Shop,Vietnamese Restaurant,Gay Bar,Cosmetics Shop,Discount Store
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,2.0,Coffee Shop,Sushi Restaurant,Discount Store,Bank,Italian Restaurant,Beer Bar,Japanese Restaurant,Nightclub,Smoothie Shop,Distribution Center
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,2.0,Pool,River,Yoga Studio,Coworking Space,Distribution Center,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,2.0,Sushi Restaurant,Coffee Shop,Diner,Beer Bar,Hobby Shop,Creperie,Ethiopian Restaurant,Pub,Ramen Restaurant,Indian Restaurant
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,2.0,Light Rail Station,Yoga Studio,Comic Shop,Skate Park,Brewery,Burrito Place,Farmers Market,Fast Food Restaurant,Restaurant,Smoke Shop
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,2.0,Baseball Field,Construction & Landscaping,Yoga Studio,Cuban Restaurant,Donut Shop,Dog Run,Distribution Center,Discount Store,Diner,Dim Sum Restaurant


<b> Dropping al NaN values to prevent data skew </b>

In [36]:
toronto_merged_nonan = toronto_merged.dropna(subset = ['Cluster Labels'])

<b> Importing required libraries for plotting the clusters on map </b>

In [37]:
import matplotlib.cm as cm
import matplotlib.colors as colors

<b> Plotting the clusters on map </b>

In [38]:
map_clusters = folium.Map(location = [latitude, longitude], zoom_start = 11)

x = np.arange(k)
ys = [i + x + (i * x) ** 2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, long, poi, cluster in zip(toronto_merged_nonan['Latitude'], toronto_merged_nonan['Longitude'], 
                                   toronto_merged_nonan['Neighbourhood'], toronto_merged_nonan['Cluster Labels']):
    label = folium.Popup('Cluster' + str(int(cluster)) + '\n' + str(poi), parse_html = True)
    folium.CircleMarker([lat, long], radius = 5, popup = label, color = rainbow[int(cluster - 1)], 
                        fill = True, fill_color = rainbow[int(cluster - 1)]).add_to(map_clusters)

map_clusters

<b> We have successfully clustered Toronto neighbourhoods based on the venue categories </b>