# Segmenting and Clustering Neighborhoods in Toronto

## Introduction

In this assignment, I mimiced what we did in the lab "Segmenting and Clustering Neighborhoods in New York City". Here is what I did. First, I used google geo API to retrive geoinfo for all the boroughs in Toronto. Then, I utilized Foursquare API to explore four of the boroughs in Toronto, which are Downtown Toronto, East Toronto, West Toronto, and Central Toronto. I also used k-means clustering algorithm to group the neighborhoods into clusters. Finally, I used the Folium library to visualize the neighborhoods in Toronto and their emerging clusters. 

## For this assignment, we will be required to explore and cluster the neighborhoods in Toronto.

    Table of Contents
    1. Web Scraping.
    2. Retriving Latitude and Longitude Using Google Geocoder.
    3. Explore Neighborhoods in Toronto.
    4. Analyze Each Neighborhood.
    5. Cluster Neighborhoods.
    6. Examine CLusters.

In [1]:
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import requests

import json 
from pandas.io.json import json_normalize

import warnings
warnings.filterwarnings('ignore')

## 1. Web Scrapping

In [2]:
# get the html of the target page in the form of text.
source = requests\
        .get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')\
        .text

In [3]:
# pass in the html file into a BeautifulSoup and specify our parser as lxml.
soup = BeautifulSoup(source, 'lxml')

In [4]:
# find the table tag in the html
table = soup.find('table')

In [5]:
# find all the td tags under the table tag
tds = table.find_all('td')

In [6]:
postal_code_df = pd.DataFrame(columns=['Postal Code', 'Borough', 'Neighborhood'])

# Loop through a list of td tags with a step size of 3
# and append the info to the dataframe.
for i in range(0, len(tds), 3):
    postcode = tds[i].text.strip()
    borough = tds[i+1].text.strip()
    neighbourhood = tds[i+2].text.strip()
    
    if borough == 'Not assigned':
        continue

    if borough != 'Not assigned' and neighbourhood == 'Not assigned':
        neighbourhood = borough
    
    postal_code_df = postal_code_df.append({
        'Postal Code': postcode,
        'Borough': borough,
        'Neighborhood': neighbourhood
    }, ignore_index=True)

In [7]:
postal_code_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [8]:
postal_code_df.shape

(212, 3)

##  2. Geocoder to Get Latitude and Longitude

In this section, we are going to use geocoder from google to retrive latitude and longitude for all the neighborhoods in the citry, Toronto.

In [9]:
import os
import pickle
import googlemaps

In [10]:
#gmaps = googlemaps.Client(key=key)

In [11]:
# define a function to retrive the geoinfo for all
# the neighborhoods in Toronto.
def get_lat_lng(postal_code_df):
    lats = []
    lngs = []

    for i in range(postal_code_df.shape[0]):
        current_row = postal_code_df.iloc[i, :]

        geo_info = '{}, Toronto, Ontario'.\
                    format(current_row['Postal Code'])
        
        # retrive geoinfo.
        geocode_result = gmaps.geocode(geo_info)   

        lat_lng_coords = geocode_result[0]['geometry']['location']

        lats.append(lat_lng_coords['lat'])
        lngs.append(lat_lng_coords['lng'])

    postal_code_df['Latitude'] = lats
    postal_code_df['Longitude'] = lngs
    
    # serialize the dataframe so that we do not 
    # have to call Google Geo API everytime 
    # re-running the program.
    pickle_out = open('postal_code_df.pkl', 'wb')
    pickle.dump(postal_code_df, pickle_out)

    return postal_code_df

In [12]:
exists = os.path.isfile('postal_code_df.pkl')

# load if the dataframe already exists
if exists:
    pickle_in = open('postal_code_df.pkl', 'rb')
    postal_code_df = pickle.load(pickle_in)
else:
    postal_code_df = get_lat_lng(postal_code_df)

In [13]:
postal_code_df.head(11)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M5A,Downtown Toronto,Regent Park,43.65426,-79.360636
4,M6A,North York,Lawrence Heights,43.718518,-79.464763
5,M6A,North York,Lawrence Manor,43.718518,-79.464763
6,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
7,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
8,M1B,Scarborough,Rouge,43.806686,-79.194353
9,M1B,Scarborough,Malvern,43.806686,-79.194353


In [14]:
postal_code_df.Borough.unique()

array(['North York', 'Downtown Toronto', "Queen's Park", 'Etobicoke',
       'Scarborough', 'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

In [15]:
postal_code_df.loc[191, :]

Postal Code                  M4X
Borough         Downtown Toronto
Neighborhood      St. James Town
Latitude                  43.668
Longitude               -79.3677
Name: 191, dtype: object

In [16]:
# The dataframe contains two columns of the same info shown as above.
# Thus, we delete the duplicate.
postal_code_df.drop(index=191, axis=0, inplace=True)

## 3. Explore Toronto

We are only going to explore the boroughs that contains 'Toronto' in their names, namely Downtown Toronto, East Toronto, West Toronto, and Central Toronto.

In [17]:
from geopy.geocoders import Nominatim
import folium

In [18]:
# load the indices of all columns with target boroughs.
toronto_idx = [i for i in postal_code_df.index.tolist() 
                if 'Toronto' in postal_code_df.loc[i, 'Borough']]

In [19]:
# retrive the columns that contains 'Toronto' in the values of Borough column.
toronto_df = postal_code_df.iloc[toronto_idx, 1:].reset_index()

In [20]:
# delete the extra column names 'index'.
toronto_df.drop(columns='index', axis=1, inplace=True)

In [21]:
toronto_df.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Downtown Toronto,Harbourfront,43.65426,-79.360636
1,Downtown Toronto,Regent Park,43.65426,-79.360636
2,Downtown Toronto,Ryerson,43.657162,-79.378937
3,Downtown Toronto,Garden District,43.657162,-79.378937
4,Downtown Toronto,St. James Town,43.651494,-79.375418


In [22]:
print('The dataframe has {} boroughs and {} neighborhoods.'
     .format(
         len(toronto_df['Borough'].unique()),
         len(toronto_df['Neighborhood'].unique())
     ))

The dataframe has 5 boroughs and 73 neighborhoods.


In [23]:
address = 'Toronto, ON'

geolocator = Nominatim()
toronto_location = geolocator.geocode(address)
toronto_lat = toronto_location.latitude
toronto_lng = toronto_location.longitude

print('The geographical coordinate of Toronto are {}, {}'
      .format(toronto_lat, toronto_lng))

The geographical coordinate of Toronto are 43.653963, -79.387207


In [24]:
# create map of New York using latitude and longitude values.
map_toronto = folium.Map(location = [toronto_lat, toronto_lng],
                        zoom_start = 13)

# add markers to map
for lat, lng, label in zip(toronto_df['Latitude'], 
                           toronto_df['Longitude'],
                          toronto_df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7).add_to(map_toronto)
    
map_toronto

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

## Let's explore the first neighborhood in our dataframe.

In [26]:
toronto_df.loc[0, 'Neighborhood']

'Harbourfront'

In [27]:
neighborhood_latitude = toronto_df.loc[0, 'Latitude']
neighborhood_longitude = toronto_df.loc[0, 'Longitude']

neighborhood_name = toronto_df.loc[0, 'Neighborhood']

print('Latitude and longitude values of {} are {}, {}.'
      .format(neighborhood_name, neighborhood_latitude, neighborhood_longitude))

Latitude and longitude values of Harbourfront are 43.6542599, -79.36063589999999.


#### Now, let's get the top 100 venues that are in Harbourfront within a radius of 500 meters

First, let's create the GET request URL. Name your URL **url**.

In [28]:
# type your answer here
LIMIT = 100
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION,
    neighborhood_latitude,
    neighborhood_longitude,
    radius,
    LIMIT
)

url

'https://api.foursquare.com/v2/venues/explore?&client_id=0IEBVKH4I4CAMHHIPOTPFHPJ4WCZAHLQ2RH30J2KANUXKQBW&client_secret=NJMTTV2P23SM45TRBQKWVIYR5MVQZR0EZJMGJBTYBRDF0UR3&v=20180920&ll=43.6542599,-79.36063589999999&radius=500&limit=100'

Send the GET request and exmaine the results.

In [29]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ba9a6a1db04f50f043212cc'},
 'response': {'headerLocation': 'Corktown',
  'headerFullLocation': 'Corktown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 51,
  'suggestedBounds': {'ne': {'lat': 43.6587599045, 'lng': -79.35442790014858},
   'sw': {'lat': 43.6497598955, 'lng': -79.3668438998514}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '54ea41ad498e9a11e9e13308',
       'name': 'Roselle Desserts',
       'location': {'address': '362 King St E',
        'crossStreet': 'Trinity St',
        'lat': 43.653446723052674,
        'lng': -79.3620167174383,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.653446723052674,
          'lng': -79.3620167174383}],
        'distance': 143,
       

In [30]:
# function that extracts the category of the venue.
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Clean the json and structure it into a pandas dataframe

In [31]:
venues = results['response']['groups'][0]['items']

nearby_venues = json_normalize(venues)

filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

# filter the category for each row.
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns.
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Roselle Desserts,Bakery,43.653447,-79.362017
1,Tandem Coffee,Coffee Shop,43.653559,-79.361809
2,Body Blitz Spa East,Spa,43.654735,-79.359874
3,Morning Glory Cafe,Breakfast Spot,43.653947,-79.361149
4,Cooper Koo YMCA,Gym / Fitness Center,43.653191,-79.357947


In [32]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

51 venues were returned by Foursquare.


# 3. Explore Neighborhoods in Toronto

In [33]:
# This function repeats to retrive information of
# all the neighborhoods in Toronto.
def get_nearby_venues(names, latitudes, longitudes, radius=500):
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']
        ) for v in results])
        
        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        nearby_venues.columns = ['Neighborhood',
                              'Neighborhood Latitude', 
                              'Neighborhood Longitude', 
                              'Venue', 
                              'Venue Latitude', 
                              'Venue Longitude', 
                              'Venue Category']
    return nearby_venues

In [34]:
toronto_venues = get_nearby_venues(names=toronto_df['Neighborhood'],
                                  latitudes = toronto_df['Latitude'],
                                  longitudes = toronto_df['Longitude'])

Harbourfront
Regent Park
Ryerson
Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Adelaide
King
Richmond
Dovercourt Village
Dufferin
Harbourfront East
Toronto Islands
Union Station
Little Portugal
Trinity
The Danforth West
Riverdale
Design Exchange
Toronto Dominion Centre
Brockton
Exhibition Place
Parkdale Village
The Beaches West
India Bazaar
Commerce Court
Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North
Forest Hill West
High Park
The Junction South
North Toronto West
The Annex
North Midtown
Yorkville
Parkdale
Roncesvalles
Davisville
Harbord
University of Toronto
Runnymede
Swansea
Moore Park
Summerhill East
Chinatown
Grange Park
Kensington Market
Deer Park
Forest Hill SE
Rathnelly
South Hill
Summerhill West
CN Tower
Bathurst Quay
Island airport
Harbourfront West
King and Spadina
Railway Lands
South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown
Underground city
The Kingsway
Business reply mail Pro

In [35]:
print('The dataframe contains {} rows and {} features/columns.'
      .format(toronto_venues.shape[0], toronto_venues.shape[1]))
toronto_venues.head()

The dataframe contains 3063 rows and 7 features/columns.


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
3,Harbourfront,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
4,Harbourfront,43.65426,-79.360636,Cooper Koo YMCA,43.653191,-79.357947,Gym / Fitness Center


In [36]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelaide,100,100,100,100,100,100
Bathurst Quay,13,13,13,13,13,13
Berczy Park,56,56,56,56,56,56
Brockton,21,21,21,21,21,21
Business reply mail Processing Centre969 Eastern,17,17,17,17,17,17
CN Tower,13,13,13,13,13,13
Cabbagetown,46,46,46,46,46,46
Central Bay Street,85,85,85,85,85,85
Chinatown,100,100,100,100,100,100
Christie,16,16,16,16,16,16


In [37]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 229 uniques categories.


## 4. Analyze Each Neighborhood

In [38]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
print('This dataframe contains {} rows and {} columns.'.
      format(toronto_onehot.shape[0], toronto_onehot.shape[1]))

This dataframe contains 3063 rows and 229 columns.


#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [40]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Women's Store
0,Adelaide,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.01
1,Bathurst Quay,0.0,0.0,0.076923,0.076923,0.153846,0.153846,0.153846,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Brockton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Business reply mail Processing Centre969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
print('There are {} neighborhoods and {} venues'
      .format(toronto_grouped.shape[0], toronto_grouped.shape[1]))

There are 73 neighborhoods and 229 venues


#### Let's print each neighborhood along with the top 5 most common venues.

In [42]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue', 'freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print("\n")

----Adelaide----
                 venue  freq
0          Coffee Shop  0.06
1                 Café  0.06
2           Steakhouse  0.04
3  American Restaurant  0.04
4      Thai Restaurant  0.04


----Bathurst Quay----
              venue  freq
0  Airport Terminal  0.15
1    Airport Lounge  0.15
2   Airport Service  0.15
3   Harbor / Marina  0.08
4  Sculpture Garden  0.08


----Berczy Park----
            venue  freq
0     Coffee Shop  0.07
1    Cocktail Bar  0.05
2      Restaurant  0.04
3  Farmers Market  0.04
4     Cheese Shop  0.04


----Brockton----
                    venue  freq
0             Coffee Shop  0.14
1          Breakfast Spot  0.10
2                    Café  0.10
3  Furniture / Home Store  0.05
4    Caribbean Restaurant  0.05


----Business reply mail Processing Centre969 Eastern----
              venue  freq
0        Comic Shop  0.06
1     Auto Workshop  0.06
2              Park  0.06
3  Recording Studio  0.06
4        Restaurant  0.06


----CN Tower----
              venu

                 venue  freq
0          Coffee Shop  0.06
1                 Café  0.06
2           Steakhouse  0.04
3  American Restaurant  0.04
4      Thai Restaurant  0.04


----Riverdale----
                venue  freq
0    Greek Restaurant  0.26
1      Ice Cream Shop  0.07
2         Coffee Shop  0.07
3           Bookstore  0.05
4  Italian Restaurant  0.05


----Roncesvalles----
           venue  freq
0    Coffee Shop  0.15
1      Gift Shop  0.15
2        Dog Run  0.08
3            Bar  0.08
4  Movie Theater  0.08


----Rosedale----
         venue  freq
0         Park  0.50
1   Playground  0.25
2        Trail  0.25
3  Yoga Studio  0.00
4       Museum  0.00


----Roselawn----
                      venue  freq
0                    Garden   1.0
1               Music Venue   0.0
2         Martial Arts Dojo   0.0
3            Medical Center   0.0
4  Mediterranean Restaurant   0.0


----Runnymede----
                venue  freq
0         Coffee Shop  0.09
1                Café  0.07
2    

#### Let's put that into a pandas dataframe.

First, let's write a function to sort the venues in descending order.

In [43]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [44]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for idx in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(idx+1, indicators[idx]))
    except:
        columns.append('{}th Most Common Venue'.format(idx+1))
        
neighborhoods_venues_sorted = pd.DataFrame(columns = columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for idx in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[idx, 1:] = return_most_common_venues(toronto_grouped.iloc[idx, :],
                                                                         num_top_venues)
neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Adelaide,Café,Coffee Shop,American Restaurant,Steakhouse,Thai Restaurant,Restaurant,Hotel,Burger Joint,Gym,Bar
1,Bathurst Quay,Airport Lounge,Airport Service,Airport Terminal,Boutique,Harbor / Marina,Airport,Airport Food Court,Boat or Ferry,Sculpture Garden,Plane
2,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Beer Bar,Bakery,Steakhouse,Restaurant,Cheese Shop,Café,Farmers Market
3,Brockton,Coffee Shop,Café,Breakfast Spot,Gym / Fitness Center,Stadium,Gym,Furniture / Home Store,Italian Restaurant,Falafel Restaurant,Convenience Store
4,Business reply mail Processing Centre969 Eastern,Smoke Shop,Light Rail Station,Recording Studio,Butcher,Auto Workshop,Fast Food Restaurant,Spa,Farmers Market,Restaurant,Brewery


# 4. Cluster Neighborhoods

Run k-means to cluster the neighborhood into 5 clusters.

In [45]:
from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

In [46]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe.
kmeans.labels_[0:10]

array([0, 3, 0, 0, 0, 3, 0, 0, 0, 0], dtype=int32)

In [47]:
toronto_merged = toronto_df

# add clustering labels
toronto_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,Harbourfront,43.65426,-79.360636,0,Coffee Shop,Bakery,Park,Café,Pub,Mexican Restaurant,Breakfast Spot,Theater,Gym / Fitness Center,French Restaurant
1,Downtown Toronto,Regent Park,43.65426,-79.360636,3,Coffee Shop,Bakery,Park,Café,Pub,Mexican Restaurant,Breakfast Spot,Theater,Gym / Fitness Center,French Restaurant
2,Downtown Toronto,Ryerson,43.657162,-79.378937,0,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Bar,Japanese Restaurant,Tea Room,Middle Eastern Restaurant,Ramen Restaurant,Italian Restaurant
3,Downtown Toronto,Garden District,43.657162,-79.378937,0,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Bar,Japanese Restaurant,Tea Room,Middle Eastern Restaurant,Ramen Restaurant,Italian Restaurant
4,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Coffee Shop,Restaurant,Café,Hotel,Clothing Store,Japanese Restaurant,Gastropub,Bakery,Cosmetics Shop,Italian Restaurant


In [48]:
# create map
map_clusters = folium.Map(location=[toronto_lat, toronto_lng], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lng, poi, cluster in zip(toronto_merged['Latitude'],
                                  toronto_merged['Longitude'],
                                 toronto_merged['Neighborhood'],
                                 toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = rainbow[cluster-1],
        fill = True,
        fill_color = rainbow[cluster-1],
        fill_opacity = 0.7).add_to(map_clusters)

map_clusters

# 5. Examine CLusters

NOw, let's examine each cluster and determine the discriminating venue categories that distinguish each cluster.

#### Cluster 1

In [49]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0,
                   toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Harbourfront,Coffee Shop,Bakery,Park,Café,Pub,Mexican Restaurant,Breakfast Spot,Theater,Gym / Fitness Center,French Restaurant
2,Ryerson,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Bar,Japanese Restaurant,Tea Room,Middle Eastern Restaurant,Ramen Restaurant,Italian Restaurant
3,Garden District,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Bar,Japanese Restaurant,Tea Room,Middle Eastern Restaurant,Ramen Restaurant,Italian Restaurant
4,St. James Town,Coffee Shop,Restaurant,Café,Hotel,Clothing Store,Japanese Restaurant,Gastropub,Bakery,Cosmetics Shop,Italian Restaurant
6,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Beer Bar,Bakery,Steakhouse,Restaurant,Cheese Shop,Café,Farmers Market
7,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Ice Cream Shop,Bar,Burger Joint,Japanese Restaurant,Falafel Restaurant,Thai Restaurant,Indian Restaurant
8,Christie,Grocery Store,Café,Park,Nightclub,Diner,Italian Restaurant,Baby Store,Restaurant,Athletics & Sports,Convenience Store
9,Adelaide,Café,Coffee Shop,American Restaurant,Steakhouse,Thai Restaurant,Restaurant,Hotel,Burger Joint,Gym,Bar
10,King,Café,Coffee Shop,American Restaurant,Steakhouse,Thai Restaurant,Restaurant,Hotel,Burger Joint,Gym,Bar
11,Richmond,Café,Coffee Shop,American Restaurant,Steakhouse,Thai Restaurant,Restaurant,Hotel,Burger Joint,Gym,Bar


#### Cluster 2

In [50]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1,
                   toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
36,High Park,Mexican Restaurant,Café,Bar,Grocery Store,Arts & Crafts Store,Music Venue,Diner,Cajun / Creole Restaurant,Sandwich Place,Bookstore
47,Runnymede,Coffee Shop,Café,Sushi Restaurant,Pizza Place,Italian Restaurant,Diner,Restaurant,Bar,Pub,Gym
56,Rathnelly,Coffee Shop,Pub,Sports Bar,Fried Chicken Joint,Vietnamese Restaurant,Supermarket,Sushi Restaurant,Convenience Store,Pizza Place,American Restaurant


#### Cluster 3

In [51]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2,
                   toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
48,Swansea,Coffee Shop,Café,Sushi Restaurant,Pizza Place,Italian Restaurant,Diner,Restaurant,Bar,Pub,Gym


#### Cluster 4

In [52]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3,
                   toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Regent Park,Coffee Shop,Bakery,Park,Café,Pub,Mexican Restaurant,Breakfast Spot,Theater,Gym / Fitness Center,French Restaurant
5,The Beaches,Coffee Shop,Trail,Pub,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
26,The Beaches West,Sandwich Place,Park,Steakhouse,Sushi Restaurant,Ice Cream Shop,Italian Restaurant,Pub,Burrito Place,Burger Joint,Fish & Chips Shop
30,Studio District,Café,Coffee Shop,American Restaurant,Bakery,Italian Restaurant,Gastropub,Gym / Fitness Center,Coworking Space,Park,New American Restaurant
33,Davisville North,Hotel,Park,Breakfast Spot,Food & Drink Shop,Burger Joint,Sandwich Place,Women's Store,Dim Sum Restaurant,Electronics Store,Eastern European Restaurant
41,Yorkville,Coffee Shop,Café,Sandwich Place,Pizza Place,French Restaurant,Convenience Store,Cosmetics Shop,Pub,Burger Joint,Liquor Store
52,Grange Park,Café,Vegetarian / Vegan Restaurant,Chinese Restaurant,Bar,Vietnamese Restaurant,Bakery,Mexican Restaurant,Coffee Shop,Dumpling Restaurant,Ice Cream Shop


#### Cluster 5

In [53]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4,
                   toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
28,Commerce Court,Coffee Shop,Café,Hotel,Restaurant,American Restaurant,Gym,Seafood Restaurant,Deli / Bodega,Italian Restaurant,Steakhouse
