# Segmenting and Clustering Neighboroods in Toronto

## Getting and cleaning the data

### 1) Scraping the table from Wikipedia

No need to use Beautiful Soup to import the dataframe, as Pandas has a useful `read.html` function which returns a list containing all the tables in a page, already converted into DataFrames.

In [6]:
import pandas as pd

In [7]:
url  = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [8]:
#resetting to default the number of rows displayed on output
pd.reset_option("display.max_rows")
dfs = pd.read_html(url)
dfs

[    Postal Code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 ..          ...               ...   
 175         M5Z      Not assigned   
 176         M6Z      Not assigned   
 177         M7Z      Not assigned   
 178         M8Z         Etobicoke   
 179         M9Z      Not assigned   
 
                                          Neighbourhood  
 0                                         Not assigned  
 1                                         Not assigned  
 2                                            Parkwoods  
 3                                     Victoria Village  
 4                            Regent Park, Harbourfront  
 ..                                                 ...  
 175                                       Not assigned  
 176                                       Not assigned  
 177                

The table is indexed as the first data frame in the list.

In [9]:
toronto_pc = dfs[0]
toronto_pc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Postal Code    180 non-null    object
 1   Borough        180 non-null    object
 2   Neighbourhood  180 non-null    object
dtypes: object(3)
memory usage: 4.3+ KB


We have to process only the cells that have an assigned borough and ignore cells with a borough that is 'Not Assigned'.

In [10]:
toronto_pc= toronto_pc[toronto_pc.Borough != 'Not assigned']
toronto_pc

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [11]:
#Checking that every row contains a different Postal Code
toronto_pc['Postal Code'].nunique() == len(toronto_pc)

True

In [12]:
#cleaning the index 
# the parameter drop =True avoids to create a new index columns with the old values;
toronto_pc.reset_index(inplace = True, drop = True) 
toronto_pc

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


### 2) Fetching Coordinates

Since the geocoder package seems to show many issues, I will use the package pgeocode to import the coordinates. After setting the local ('ca'), this library returns a Pandas Data Frame in answer to a Postal Code query. From this data frame we will select only the data pertaining latitude and longitude.

In [13]:
import pgeocode 

nomi = pgeocode.Nominatim('ca')

Latitude = []
Longitude = []

for pc in toronto_pc['Postal Code']:
    query = nomi.query_postal_code(pc)
    Latitude.append(query.latitude)
    Longitude.append(query.longitude)

In [14]:
#assigning the coordinates to new columns in the existing dataframe
toronto_pc['Latitude'] = Latitude
toronto_pc['Longitude'] = Longitude
toronto_pc

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.3300
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.6518,-79.5076
99,M4Y,Downtown Toronto,Church and Wellesley,43.6656,-79.3830
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.7804,-79.2505
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.6325,-79.4939


We got a common warning in Pandas, but we can ignore that. The list of coordinates seems correctly placed. 

In [15]:
toronto_pc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Postal Code    103 non-null    object 
 1   Borough        103 non-null    object 
 2   Neighbourhood  103 non-null    object 
 3   Latitude       102 non-null    float64
 4   Longitude      102 non-null    float64
dtypes: float64(2), object(3)
memory usage: 4.1+ KB


We got just one Nan:

In [16]:
toronto_pc[toronto_pc.isnull().any(axis = 1)]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
76,M7R,Mississauga,Canada Post Gateway Processing Centre,,


We can retrieve the coordinates manuamlly from Google Maps :

In [17]:
ll = 43.63657950496381, -79.61576357177279

In [18]:
#set the values
toronto_pc.at[76, 'Latitude'] = ll[0]
toronto_pc.at[76,'Longitude'] = ll[1]

In [19]:
toronto_pc.isnull().any(axis = 0)

Postal Code      False
Borough          False
Neighbourhood    False
Latitude         False
Longitude        False
dtype: bool

In [20]:
#checking if the values are correctly set
toronto_pc.iloc[76]

Postal Code                                        M7R
Borough                                    Mississauga
Neighbourhood    Canada Post Gateway Processing Centre
Latitude                                      43.63658
Longitude                                   -79.615764
Name: 76, dtype: object

In [21]:
toronto_pc

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.3300
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.6518,-79.5076
99,M4Y,Downtown Toronto,Church and Wellesley,43.6656,-79.3830
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.7804,-79.2505
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.6325,-79.4939


Let's examine how then neighbourhoods are distributed by borough, that will be useful later to decide how many cluster do we need:

In [42]:
print(toronto_pc.groupby(['Borough'])['Borough'].count())
print()
print(toronto_pc.groupby(['Borough'])['Borough'].count().describe())

Borough
Central Toronto      9
Downtown Toronto    19
East Toronto         5
East York            5
Etobicoke           12
Mississauga          1
North York          24
Scarborough         17
West Toronto         6
York                 5
Name: Borough, dtype: int64

count    10.000000
mean     10.300000
std       7.469196
min       1.000000
25%       5.000000
50%       7.500000
75%      15.750000
max      24.000000
Name: Borough, dtype: float64


## Clustering

Importing the libraries needed for visualization and clustering:

In [23]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library
#library needed to get Toronto coordinates:
!pip install geopy
from geopy.geocoders import Nominatim
print('Libraries imported.')

Libraries imported.


In order to define an instance of the geocoder, we need to define a user_agent:

In [24]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Creating a map of Toronto with neighborhoods superimposed:


In [25]:
toronto_map= folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_pc['Latitude'], toronto_pc['Longitude'], toronto_pc['Borough'], toronto_pc['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map

We are now ready to set up the Foursquare API to make queries.

 ### 1) Defining Foursquare API credentials and parameters

Since I believe working with the URL as suggested in the course to make an API is overtly complicated, I decided to use this line of code (found here : https://developer.foursquare.com/docs/places-api/getting-started/). In place of filling the URL directly with the parameters, we will create a dictionary of the parameters which we will place inside a GET request (much neater).

In [26]:
import json, requests
url = 'https://api.foursquare.com/v2/venues/explore'

params = dict(
client_id='TDUMTXN0WG4EZFIPJBH5FUG1YXLZCMFQXJTWNEZ2RSJ14W3S',
client_secret='1FEIJEA0NXDXAONICQ55E53CPLUUWHQ4FNDWJIP0V1TGO1XU',
v='20180323',
ll='43.6534817,-79.3839347',
limit=1
)
#testing a query
resp = requests.get(url=url, params=params)
data = json.loads(resp.text)

I willl borrow the function **getNearbyVenues**  used in the Lab to loop 'explore' queries through the neighboroods and get the corresponding venues.

In [27]:
#since we have 103 different neighboroods, we are setting the query limit to 50 so as not to exceed tha API call limit
params['limit'] = 50

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        #modify the coordinates
        params['ll'] = str(lat)+','+str(lng)
            
        # make the GET request
        results = requests.get(url, params).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [28]:
toronto_venues = getNearbyVenues(toronto_pc.Neighbourhood, toronto_pc.Latitude, toronto_pc.Longitude, radius=500)

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [29]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.7545,-79.33,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,Parkwoods,43.7545,-79.33,Tim Hortons,43.760668,-79.326368,Café
2,Parkwoods,43.7545,-79.33,Donalda Golf & Country Club,43.752816,-79.342741,Golf Course
3,Parkwoods,43.7545,-79.33,Galleria Supermarket,43.75352,-79.349518,Supermarket
4,Parkwoods,43.7545,-79.33,Graydon Hall Manor,43.763923,-79.342961,Event Space


In [30]:

print(toronto_venues.groupby('Neighborhood').count()['Venue'])
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

Neighborhood
Agincourt                                          50
Alderwood, Long Branch                             50
Bathurst Manor, Wilson Heights, Downsview North    50
Bayview Village                                    50
Bedford Park, Lawrence Manor East                  50
                                                   ..
Willowdale, Willowdale West                        50
Woburn                                             50
Woodbine Heights                                   50
York Mills West                                    50
York Mills, Silver Hills                           50
Name: Venue, Length: 99, dtype: int64
There are 284 uniques categories.


### 2) Analyzing the most frequent venues

In [31]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] =toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Zoo Exhibit,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
toronto_onehot.shape

(5150, 284)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [33]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Zoo Exhibit,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,...,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo
0,Agincourt,0.0,0.0,0.0,0.0,0.00,0.0,0.00,0.0,0.0,...,0.0,0.0,0.00,0.00,0.0,0.00,0.0,0.00,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.00,0.0,0.00,0.0,0.0,...,0.0,0.0,0.00,0.00,0.0,0.02,0.0,0.00,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.02,0.0,0.00,0.0,0.0,...,0.0,0.0,0.00,0.02,0.0,0.00,0.0,0.00,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.00,0.0,0.00,0.0,0.0,...,0.0,0.0,0.00,0.00,0.0,0.00,0.0,0.00,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.00,0.0,0.00,0.0,0.0,...,0.0,0.0,0.00,0.00,0.0,0.02,0.0,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.00,0.0,0.00,0.0,0.0,...,0.0,0.0,0.02,0.00,0.0,0.02,0.0,0.00,0.0,0.0
95,Woburn,0.0,0.0,0.0,0.0,0.00,0.0,0.00,0.0,0.0,...,0.0,0.0,0.02,0.00,0.0,0.00,0.0,0.02,0.0,0.0
96,Woodbine Heights,0.0,0.0,0.0,0.0,0.00,0.0,0.04,0.0,0.0,...,0.0,0.0,0.00,0.00,0.0,0.00,0.0,0.00,0.0,0.0
97,York Mills West,0.0,0.0,0.0,0.0,0.00,0.0,0.00,0.0,0.0,...,0.0,0.0,0.00,0.00,0.0,0.00,0.0,0.00,0.0,0.0


#### Let's print each neighborhood along with the top 5 most common venues


In [34]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                  venue  freq
0    Chinese Restaurant  0.10
1     Indian Restaurant  0.08
2  Caribbean Restaurant  0.06
3        Clothing Store  0.04
4                Bakery  0.04


----Alderwood, Long Branch----
                venue  freq
0  Seafood Restaurant  0.06
1              Bakery  0.06
2         Coffee Shop  0.06
3         Pizza Place  0.04
4      Breakfast Spot  0.04


----Bathurst Manor, Wilson Heights, Downsview North----
                       venue  freq
0                Coffee Shop  0.08
1  Middle Eastern Restaurant  0.06
2                 Restaurant  0.06
3              Deli / Bodega  0.04
4             Clothing Store  0.04


----Bayview Village----
                    venue  freq
0       Korean Restaurant  0.08
1         Bubble Tea Shop  0.06
2             Coffee Shop  0.06
3      Chinese Restaurant  0.06
4  Furniture / Home Store  0.04


----Bedford Park, Lawrence Manor East----
              venue  freq
0       Coffee Shop  0.08
1              Café

#### Let's put that into a _pandas_ dataframe


First, let's write a function to sort the venues in descending order.


In [35]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.


In [36]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Chinese Restaurant,Indian Restaurant,Caribbean Restaurant,Clothing Store,Bakery,Supermarket,Breakfast Spot,Bubble Tea Shop,Gym / Fitness Center,Restaurant
1,"Alderwood, Long Branch",Seafood Restaurant,Bakery,Coffee Shop,Pizza Place,Breakfast Spot,Park,Fast Food Restaurant,Burrito Place,Café,Italian Restaurant
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Middle Eastern Restaurant,Restaurant,Deli / Bodega,Clothing Store,Turkish Restaurant,Park,Grocery Store,Greek Restaurant,Gym / Fitness Center
3,Bayview Village,Korean Restaurant,Bubble Tea Shop,Coffee Shop,Chinese Restaurant,Furniture / Home Store,Bakery,Liquor Store,Grocery Store,Ramen Restaurant,Supermarket
4,"Bedford Park, Lawrence Manor East",Coffee Shop,Café,Sushi Restaurant,Bakery,Bagel Shop,Fast Food Restaurant,Park,Asian Restaurant,Japanese Restaurant,Grocery Store


### 3) Cluster Neighborhoods


Run _k_-means to cluster the neighborhood into 10 clusters.


In [37]:
from sklearn.cluster import KMeans

# set number of clusters, 1 per borough
kclusters = 10

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 0, 9, 8, 0, 3, 5, 3, 2, 9])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.


In [38]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_pc

toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

toronto_merged.head() 

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.7545,-79.33,1,Middle Eastern Restaurant,Café,Supermarket,Burger Joint,Italian Restaurant,Mediterranean Restaurant,Caribbean Restaurant,Grocery Store,Burrito Place,Golf Course
1,M4A,North York,Victoria Village,43.7276,-79.3148,1,Middle Eastern Restaurant,Grocery Store,Gym / Fitness Center,Coffee Shop,Indian Restaurant,Hockey Arena,History Museum,Café,Shopping Mall,Italian Restaurant
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626,9,Coffee Shop,Park,Theater,Italian Restaurant,Bakery,Restaurant,Performing Arts Venue,Breakfast Spot,Café,Spa
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504,9,Clothing Store,Restaurant,Furniture / Home Store,Greek Restaurant,Turkish Restaurant,Cosmetics Shop,Coffee Shop,Fried Chicken Joint,Steakhouse,Sushi Restaurant
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889,8,Coffee Shop,Sushi Restaurant,Hotel,Café,Gym,Japanese Restaurant,Yoga Studio,Adult Boutique,Dessert Shop,Bubble Tea Shop


Finally, let's visualize the resulting clusters :


In [39]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 4) Examine Clusters


Finally let's check the discriminating categories between each cluster, based on the first three most common venues:

In [40]:
toronto_merged.groupby('Cluster Labels')[['1st Most Common Venue','2nd Most Common Venue','3rd Most Common Venue']].describe()

Unnamed: 0_level_0,1st Most Common Venue,1st Most Common Venue,1st Most Common Venue,1st Most Common Venue,2nd Most Common Venue,2nd Most Common Venue,2nd Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,3rd Most Common Venue,3rd Most Common Venue,3rd Most Common Venue
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq
Cluster Labels,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
0,13,7,Coffee Shop,3,13,9,Bakery,4,13,11,Italian Restaurant,2
1,9,4,Middle Eastern Restaurant,6,9,6,Japanese Restaurant,4,9,6,Italian Restaurant,2
2,5,3,Coffee Shop,2,5,4,Caribbean Restaurant,2,5,2,Gym,3
3,13,5,Café,9,13,6,Coffee Shop,7,13,9,Hotel,5
4,6,2,Chinese Restaurant,5,6,6,Chinese Restaurant,1,6,3,Caribbean Restaurant,3
5,9,3,Park,7,9,8,Gym,2,9,8,Park,2
6,14,3,Italian Restaurant,12,14,6,Café,5,14,9,Bakery,3
7,3,2,Zoo Exhibit,2,3,3,Zoo Exhibit,1,3,3,Breakfast Spot,1
8,7,3,Coffee Shop,3,7,7,Cosmetics Shop,1,7,6,Coffee Shop,2
9,24,10,Coffee Shop,10,24,14,Coffee Shop,6,24,14,Restaurant,4
