# Segmenting and Clustering of Neighborgood in Toronto 


## Summary of Project

- 1. **Objective**
- 2. **Data wrangling**
- 2.1 Loading data
- 2.2 Cleaning data
- 3. **Geolocation and Folium**
- 3.1 Merging Latitude and Longtidude Data
- 3.2 Mapping City of Toronto area
- 3.3 Mapping Downtown Toronto area
- 3.4 JSON data - shops and business
- 3.5 JSON data to Pandas DF
- 3.6 Finding venues and neighborhoods
- 3.7 Analyzing neighborhoods
- 4. **Model Evaluation**
- 4.1 K-Means Clustering
- 4.2 K-Means Clustering Visualisation
- 4.3 Examining Clusters


## 1. Objective

Objective of this project is to analyze City of Toronto and Downtown Toronto area shops and business locations. K-Means clustering has been used to determine same business classes in the city area. It is significant to help new starters to choose right location for their business. This projects aid to solve this problem. 

## 2. Data Wrangling

In [1]:
# Importing Essential libraries:

import pandas as pd
import matplotlib as pl
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

from pandas import Series, DataFrame
from matplotlib import rcParams
from matplotlib import pyplot

import json

import requests 
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim 
import folium

### 2.1 Loading data

In [2]:
#Downloading data from Wikipedia
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0] # Note: the data can be changed.
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M1ANot assigned,M2ANot assigned,M3ANorth York(Parkwoods),M4ANorth York(Victoria Village),M5ADowntown Toronto(Regent Park / Harbourfront),M6ANorth York(Lawrence Manor / Lawrence Heights),M7AQueen's Park(Ontario Provincial Government),M8ANot assigned,M9AEtobicoke(Islington Avenue)
1,M1BScarborough(Malvern / Rouge),M2BNot assigned,M3BNorth York(Don Mills)North,M4BEast York(Parkview Hill / Woodbine Gardens),"M5BDowntown Toronto(Garden District, Ryerson)",M6BNorth York(Glencairn),M7BNot assigned,M8BNot assigned,M9BEtobicoke(West Deane Park / Princess Garden...
2,M1CScarborough(Rouge Hill / Port Union / Highl...,M2CNot assigned,M3CNorth York(Don Mills)South(Flemingdon Park),M4CEast York(Woodbine Heights),M5CDowntown Toronto(St. James Town),M6CYork(Humewood-Cedarvale),M7CNot assigned,M8CNot assigned,M9CEtobicoke(Eringate / Bloordale Gardens / Ol...
3,M1EScarborough(Guildwood / Morningside / West ...,M2ENot assigned,M3ENot assigned,M4EEast Toronto(The Beaches),M5EDowntown Toronto(Berczy Park),M6EYork(Caledonia-Fairbanks),M7ENot assigned,M8ENot assigned,M9ENot assigned
4,M1GScarborough(Woburn),M2GNot assigned,M3GNot assigned,M4GEast York(Leaside),M5GDowntown Toronto(Central Bay Street),M6GDowntown Toronto(Christie),M7GNot assigned,M8GNot assigned,M9GNot assigned


In [3]:
# Let's define the shape
df.shape

(20, 9)

### 2.2 Cleaning data

As we see from above, we need just one column to show the data. 

In [4]:
# Let's stack 9 columns to 1 column in order to make it easy to work.
df= pd.DataFrame(df.stack().reset_index(drop = True))
df.head()

Unnamed: 0,0
0,M1ANot assigned
1,M2ANot assigned
2,M3ANorth York(Parkwoods)
3,M4ANorth York(Victoria Village)
4,M5ADowntown Toronto(Regent Park / Harbourfront)


Here we need to drop not assigned values, create three different columns postal, region and neighborhood. We need use lamda and map() function to do that.

In [5]:
# Data cleaning by using Lambda and map() function.

df.columns = ['all'] # Changing column name to 'all' from 0

df['Postal Code'] = df['all'].map(lambda x:x[0:3]) # separating postal code first tree letters.

df['reg_neig'] = df['all'].map(lambda x:x[3:]) # separating after postal code and making columns

df.drop(df[df['reg_neig'] == 'Not assigned'].index , inplace=True) # dropping Not assigned values

df['region'] = df['reg_neig'].map(lambda x:x.split('(', 1)[0]) # Splitting region from neighborhood

df['Neighborhood'] = df['reg_neig'].map(lambda x:x.split('(', 1)[-1]).str[:-1] # deleting parathesis

df.drop(columns = ['all', 'reg_neig'], inplace = True) # dropping columns
df.head()

Unnamed: 0,Postal Code,region,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Queen's Park,Ontario Provincial Government


Let's combine them

In [6]:

print('The dataframe has {} regions and {} neighborhoods.' .format(
        len(df['region'].unique()),
        df.shape[0]))


The dataframe has 15 regions and 103 neighborhoods.


Let's create dataframe possessing longtitudes and latitudes

## 3. Geolocation and Folium

### 3.1 Merging Latitude and Longitude data

In [7]:
df_lat_long = pd.read_csv('https://cocl.us/Geospatial_data')
df_new = df.merge(df_lat_long, on = 'Postal Code')

df_new.head()

Unnamed: 0,Postal Code,region,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


### 3.2 Mapping City of Toronto Area

Each **geolocation** service you might use, such as Google Maps, Bing Maps, or Nominatim, has its own class in geopy.geocoders abstracting the service’s API. Geocoders each define at least a geocode method, for resolving a location from a string, and may define a reverse method, which resolves a pair of coordinates to an address. 

Let's explore and cluster neighborhood in Toronto

In [8]:
loc = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(loc)
latitude = location.latitude
longitude = location.longitude

print('Toronto Latitude is {}, Longitude is {}.' .format(latitude, longitude))

Toronto Latitude is 43.6534817, Longitude is -79.3839347.


In [9]:
map_of_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, region, neighborhood in zip(df_new['Latitude'], df_new['Longitude'], df_new['region'], df_new['Neighborhood']):
    label = '{}, {}'.format(neighborhood, region)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_of_toronto)  
    
map_of_toronto

In [10]:
# Counting region values
df_new['region'].value_counts()

North York                                                      24
Downtown Toronto                                                17
Scarborough                                                     17
Etobicoke                                                       11
Central Toronto                                                  9
West Toronto                                                     6
York                                                             5
East York                                                        4
East Toronto                                                     4
EtobicokeNorthwest                                               1
Downtown TorontoStn A PO Boxes25 The Esplanade                   1
East YorkEast Toronto                                            1
Queen's Park                                                     1
MississaugaCanada Post Gateway Processing Centre                 1
East TorontoBusiness reply mail Processing Centre969 Eastern  

### 3.3  Mapping Downtown Toronto area

In [11]:
downtown_toronto = df_new[df_new['region'] == 'Downtown Toronto'].reset_index(drop=True)
downtown_toronto.head()

Unnamed: 0,Postal Code,region,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
4,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383


In [12]:
loc1 = 'Downtown Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(loc1)
latitude = location.latitude
longitude = location.longitude
print('Toronto Latitude is {}, Longitude is {}.' .format(latitude, longitude))

Toronto Latitude is 43.6541737, Longitude is -79.38081164513409.


In [13]:
map_downtown_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, region, neighborhood in zip(downtown_toronto['Latitude'], downtown_toronto['Longitude'], downtown_toronto['region'], downtown_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, region)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_downtown_toronto)  
    
map_downtown_toronto

In [14]:
# Let's explore the neighborhood and segment them
downtown_toronto.loc[0, 'Neighborhood']

'Regent Park / Harbourfront'

In [15]:
neighbor_latitude = downtown_toronto.loc[0, 'Latitude']
neighbor_longitude = downtown_toronto.loc[0, 'Longitude']

neighbor_name = downtown_toronto.loc[0, 'Neighborhood']

print('Latitude and longitude values of {} are {} and {}' .format(neighbor_name, 
                                                               neighbor_latitude, 
                                                               neighbor_longitude)
     )

Latitude and longitude values of Regent Park / Harbourfront are 43.6542599 and -79.3606359


### 3.4 JSON Data - Shops

In [16]:
CLIENT_ID = 'TOITK2RBLZUXQDMPORAIE5GARRN2E1RRHN0RS2EGNJ1QSL3L'
CLIENT_SECRET = 'RMUMYUEAQTEYRGZBFWSQVVAWD5Z4DM2H4XJPH2FHFQONI3EV'
VERSION = '20180605'
LIMIT = 100
radius = 500

url_venues = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighbor_latitude, 
    neighbor_longitude, 
    radius, 
    LIMIT)

url_venues

'https://api.foursquare.com/v2/venues/explore?&client_id=TOITK2RBLZUXQDMPORAIE5GARRN2E1RRHN0RS2EGNJ1QSL3L&client_secret=RMUMYUEAQTEYRGZBFWSQVVAWD5Z4DM2H4XJPH2FHFQONI3EV&v=20180605&ll=43.6542599,-79.3606359&radius=500&limit=100'

**JSON** (JavaScript Object Notation) is an open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value).

In [17]:
results = requests.get(url_venues).json() # loading JSON data
results

{'meta': {'code': 200, 'requestId': '6064f3932138947f20a57125'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Corktown',
  'headerFullLocation': 'Corktown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 45,
  'suggestedBounds': {'ne': {'lat': 43.6587599045, 'lng': -79.3544279001486},
   'sw': {'lat': 43.6497598955, 'lng': -79.36684389985142}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '53b8466a498e83df908c3f21',
       'name': 'Tandem Coffee',
       'location': {'address': '368 King St E',
        'crossStreet': 'at Trinity St',
        'lat': 43.65355870959944,
        'lng': -79.36180945913513,
        'labeledLatLngs': [{'label': 'display',
 

In [18]:
# Let's define the function
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

### 3.5 JSON data to Pandas DF

In [19]:
# Let's structure json data to Pandas Dataframe
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues)

filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  after removing the cwd from sys.path.


Unnamed: 0,name,categories,lat,lng
0,Tandem Coffee,Coffee Shop,43.653559,-79.361809
1,Roselle Desserts,Bakery,43.653447,-79.362017
2,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008
3,Impact Kitchen,Restaurant,43.656369,-79.35698
4,Body Blitz Spa East,Spa,43.654735,-79.359874


### 3.6 Finding venues and neighberhoods

In [20]:
print('{} venues were returned by Foursquare near {}' .format(nearby_venues.shape[0], neighbor_name))

45 venues were returned by Foursquare near Regent Park / Harbourfront


In [21]:
# Let's repeat the all process in DT Toronto
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    LIMIT = 100
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url_foursquare = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url_foursquare).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [22]:
df_venues = getNearbyVenues(names=downtown_toronto['Neighborhood'],
                            latitudes=downtown_toronto['Latitude'],
                            longitudes=downtown_toronto['Longitude']
                            )

Regent Park / Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Christie
Richmond / Adelaide / King
Harbourfront East / Union Station / Toronto Islands
Toronto Dominion Centre / Design Exchange
Commerce Court / Victoria Hotel
University of Toronto / Harbord
Kensington Market / Chinatown / Grange Park
CN Tower / King and Spadina / Railway Lands / Harbourfront West / Bathurst Quay / South Niagara / Island airport
Rosedale
St. James Town / Cabbagetown
First Canadian Place / Underground city
Church and Wellesley


In [23]:
df_venues.shape

(1085, 7)

In [24]:
df_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Regent Park / Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
1,Regent Park / Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
2,Regent Park / Harbourfront,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,Regent Park / Harbourfront,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
4,Regent Park / Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


#### Let's count the neighberhoods which were returned

In [25]:
df_venues.groupby('Neighborhood')['Venue'].count().to_frame()

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
Berczy Park,59
CN Tower / King and Spadina / Railway Lands / Harbourfront West / Bathurst Quay / South Niagara / Island airport,16
Central Bay Street,64
Christie,16
Church and Wellesley,70
Commerce Court / Victoria Hotel,100
First Canadian Place / Underground city,100
"Garden District, Ryerson",100
Harbourfront East / Union Station / Toronto Islands,100
Kensington Market / Chinatown / Grange Park,63


In [26]:
print('There are {} unique categories.' .format(len(df_venues['Venue Category'].unique())))

There are 204 unique categories.


### 3.7 Analysing Neighborhood 

In [27]:
# Analysing neighborhood
# one hot encoding
toronto_onehot = pd.get_dummies(df_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = df_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Thai Restaurant,Theater,Theme Restaurant,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
toronto_onehot.shape

(1085, 204)

In [29]:
# groupby() by Neighborhood
df_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
df_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thai Restaurant,Theater,Theme Restaurant,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.016949,0.0,0.0,0.0,0.0,0.016949,0.0,0.0,0.0,0.0
1,CN Tower / King and Spadina / Railway Lands / ...,0.0,0.0625,0.0625,0.0625,0.125,0.125,0.125,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Central Bay Street,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.03125,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.015625,0.0
3,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Church and Wellesley,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,...,0.014286,0.014286,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [30]:
num_top_venues = 5

for hood in df_grouped['Neighborhood']:

    temp = df_grouped[df_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

          venue  freq
0   Coffee Shop  0.07
1  Cocktail Bar  0.05
2        Bakery  0.05
3      Beer Bar  0.03
4    Restaurant  0.03


              venue  freq
0    Airport Lounge  0.12
1   Airport Service  0.12
2  Airport Terminal  0.12
3     Boat or Ferry  0.06
4   Harbor / Marina  0.06


                 venue  freq
0          Coffee Shop  0.17
1   Italian Restaurant  0.05
2       Sandwich Place  0.05
3                 Café  0.05
4  Japanese Restaurant  0.03


           venue  freq
0  Grocery Store  0.25
1           Café  0.19
2           Park  0.12
3    Candy Store  0.06
4     Baby Store  0.06


                 venue  freq
0          Coffee Shop  0.09
1     Sushi Restaurant  0.07
2  Japanese Restaurant  0.07
3           Restaurant  0.04
4              Gay Bar  0.04


                venue  freq
0         Coffee Shop  0.13
1          Restaurant  0.07
2                Café  0.06
3               Hotel  0.06
4  Italian Restaurant  0.04


                 venue  freq
0          Coffee

Create a new dataframe and display the top 10 venues for each neighborhood

In [31]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [32]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = df_grouped['Neighborhood']

for ind in np.arange(df_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Bakery,Cocktail Bar,Cheese Shop,Pharmacy,Beer Bar,Restaurant,Farmers Market,Seafood Restaurant,Liquor Store
1,CN Tower / King and Spadina / Railway Lands / ...,Airport Lounge,Airport Service,Airport Terminal,Coffee Shop,Harbor / Marina,Sculpture Garden,Boat or Ferry,Rental Car Location,Bar,Boutique
2,Central Bay Street,Coffee Shop,Italian Restaurant,Café,Sandwich Place,Salad Place,Japanese Restaurant,Bubble Tea Shop,Burger Joint,Thai Restaurant,Department Store
3,Christie,Grocery Store,Café,Park,Baby Store,Nightclub,Coffee Shop,Restaurant,Italian Restaurant,Athletics & Sports,Candy Store
4,Church and Wellesley,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Gay Bar,Restaurant,Yoga Studio,Men's Store,Pub,Hotel,Fast Food Restaurant


## 4. Model Evaluation

### 4.1 K-Means Clustering

In [33]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_grouped_clustering = df_grouped.drop('Neighborhood', 1)

# run k-means clustering
KM = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
KM.labels_[0:10]

array([0, 4, 0, 2, 0, 0, 0, 0, 0, 0], dtype=int32)

In [34]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', KM.labels_)

toronto_merged = downtown_toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,region,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636,0,Coffee Shop,Pub,Bakery,Park,Breakfast Spot,Theater,Café,Wine Shop,Event Space,Performing Arts Venue
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Clothing Store,Coffee Shop,Bubble Tea Shop,Cosmetics Shop,Middle Eastern Restaurant,Café,Pizza Place,Movie Theater,Hotel,Bookstore
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Café,Coffee Shop,Cosmetics Shop,Cocktail Bar,Hotel,Gym,Lingerie Store,Department Store,Moroccan Restaurant,Park
3,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Coffee Shop,Bakery,Cocktail Bar,Cheese Shop,Pharmacy,Beer Bar,Restaurant,Farmers Market,Seafood Restaurant,Liquor Store
4,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,0,Coffee Shop,Italian Restaurant,Café,Sandwich Place,Salad Place,Japanese Restaurant,Bubble Tea Shop,Burger Joint,Thai Restaurant,Department Store


### 4.2 K-Means Clustering Visualisation

In [35]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 4.3 Examining clusters

#### Cluster 1

In [36]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,region,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,0,Coffee Shop,Pub,Bakery,Park,Breakfast Spot,Theater,Café,Wine Shop,Event Space,Performing Arts Venue
1,Downtown Toronto,0,Clothing Store,Coffee Shop,Bubble Tea Shop,Cosmetics Shop,Middle Eastern Restaurant,Café,Pizza Place,Movie Theater,Hotel,Bookstore
2,Downtown Toronto,0,Café,Coffee Shop,Cosmetics Shop,Cocktail Bar,Hotel,Gym,Lingerie Store,Department Store,Moroccan Restaurant,Park
3,Downtown Toronto,0,Coffee Shop,Bakery,Cocktail Bar,Cheese Shop,Pharmacy,Beer Bar,Restaurant,Farmers Market,Seafood Restaurant,Liquor Store
4,Downtown Toronto,0,Coffee Shop,Italian Restaurant,Café,Sandwich Place,Salad Place,Japanese Restaurant,Bubble Tea Shop,Burger Joint,Thai Restaurant,Department Store
6,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Clothing Store,Gym,Thai Restaurant,Hotel,Deli / Bodega,Sushi Restaurant,Concert Hall
7,Downtown Toronto,0,Coffee Shop,Aquarium,Café,Hotel,Sporting Goods Shop,Fried Chicken Joint,Scenic Lookout,Brewery,Italian Restaurant,Restaurant
8,Downtown Toronto,0,Coffee Shop,Hotel,Café,Restaurant,Japanese Restaurant,Salad Place,Seafood Restaurant,Italian Restaurant,Sushi Restaurant,Breakfast Spot
9,Downtown Toronto,0,Coffee Shop,Restaurant,Hotel,Café,Gym,Italian Restaurant,American Restaurant,Cocktail Bar,Deli / Bodega,Japanese Restaurant
11,Downtown Toronto,0,Café,Coffee Shop,Vegetarian / Vegan Restaurant,Mexican Restaurant,Vietnamese Restaurant,Gaming Cafe,Bar,Grocery Store,Park,Farmers Market


#### Cluster 2

In [37]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,region,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
13,Downtown Toronto,1,Park,Trail,Playground,Department Store,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


#### Cluster 3

In [38]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,region,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Downtown Toronto,2,Grocery Store,Café,Park,Baby Store,Nightclub,Coffee Shop,Restaurant,Italian Restaurant,Athletics & Sports,Candy Store


#### Cluster 4

In [39]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,region,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,Downtown Toronto,3,Café,Bookstore,Bar,Italian Restaurant,Japanese Restaurant,Bakery,Yoga Studio,Beer Bar,Comfort Food Restaurant,Sandwich Place


#### Cluster 5

In [40]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,region,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,Downtown Toronto,4,Airport Lounge,Airport Service,Airport Terminal,Coffee Shop,Harbor / Marina,Sculpture Garden,Boat or Ferry,Rental Car Location,Bar,Boutique
