## Segmenting and Clustering Neighborhoods in the City of Toronto, Canada: Part 3

Created by Rhys Morgan on 29th July 2020

[My GitHub Repository](https://github.com/rmjmorgan/Coursera_Capstone 'Coursera Capstone Project')  
[Course Info](https://www.coursera.org/learn/applied-data-science-capstone 'Applied Data Science Capstone')

This notebook will demonstrate my understanding of the Foursquare API, clustering, and displaying the imformation on a map using Folium.

---

In [1]:
# import required libraries
import numpy as np
import pandas as pd
import json
from geopy.geocoders import Nominatim
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

print('All libraries imported.')

All libraries imported.


In [2]:
# import dataframe from part 2.
Canada_df = pd.read_csv('Canadian_Geospacial_Data.csv')
Canada_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Since I wish to explore venues in Toronto only, I must remove all rows that doesn't have data relating to Toronto.

In [3]:
# filter dataframe to only contain boroughs in Toronto.
Toronto_df = Canada_df[Canada_df.Borough.str.contains('Toronto')].reset_index(drop=True)
Toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [4]:
# obtain geographical coordinates of Toronto.
address = 'Toronto, ON, Canada'

geolocator = Nominatim(user_agent='Toronto_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(
      latitude, longitude
     )
) 

The geographical coordinates of Toronto are 43.6534817, -79.3839347.


Using Folium, I can display my data on a map - so long as I know the geographical coordinates of each data entry.

In [5]:
# create folium map of Toronto.
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# display location data.
for lat, lng, borough, neighborhood in zip(Toronto_df.Latitude, Toronto_df.Longitude, Toronto_df.Borough, Toronto_df.Neighborhood):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)

map_Toronto

I can use the Foursquare API to retrieve information on recommended venues around every neighborhood in my dataframe.  
Note: My Foursquare credentials have been removed for obvious reasons.

In [27]:
# Foursquare API credentials.
CLIENT_ID = 'REDACTED'
CLIENT_SECRET = 'REDACTED'
VERSION = '20180605'
LIMIT = 100

To make the process of getting venue information of every neighborhood faster, I can create a function to loop through my dataframe.

In [7]:
# create function to obtain recommended local venues at each neighborhood.
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL.
        URL = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(URL).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue.
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [8]:
# use the function.
Toronto_venues = getNearbyVenues(names = Toronto_df.Neighborhood,
                        latitudes = Toronto_df.Latitude,
                        longitudes = Toronto_df.Longitude)

In [9]:
# display the venues that were discovered, along with their location data.
print(Toronto_venues.shape)
Toronto_venues.head()

(1631, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


K-Means does not work with categorical data in text format, so I must convert them into a numerical format.

In [10]:
# one-hot encoding
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot.Neighborhood = Toronto_venues.Neighborhood

# move neighborhood column to the first column
Toronto_onehot = Toronto_onehot.loc[:,'Neighborhood':]

Toronto_onehot.head()

Unnamed: 0,Neighborhood,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Outdoor Sculpture,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


By splitting each venue type into their own columns, the K-Means algorithm will be able to cluster them. However, it is good practice to normalize the values because K-Means uses Euclidean distance measuring.

In [11]:
# display the average frequency of venue types in each neighborhood.
Toronto_grouped = Toronto_onehot.groupby('Neighborhood').mean().reset_index()
Toronto_grouped.head()

Unnamed: 0,Neighborhood,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Outdoor Sculpture,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.086957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.016129,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.016129,0.0,0.0,0.016129,0.0,0.0,0.016129


In [12]:
# function to sort venues by highest frequency.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [13]:
# specify rank limit.
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues.
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe.
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Toronto_grouped['Neighborhood']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Berczy Park,Seafood Restaurant,Restaurant,Shopping Mall,Thai Restaurant,Sushi Restaurant
1,"Brockton, Parkdale Village, Exhibition Place",Nightclub,Restaurant,Stadium,Pet Store,Performing Arts Venue
2,"Business reply mail Processing Centre, South C...",Skate Park,Pizza Place,Restaurant,Park,Pub
3,"CN Tower, King and Spadina, Railway Lands, Har...",Rental Car Location,Sculpture Garden,Yoga Studio,Salad Place,Ramen Restaurant
4,Central Bay Street,Sandwich Place,Salad Place,Yoga Studio,Thai Restaurant,Office


In [14]:
# define number of clusters.
kclusters = 4

Toronto_grouped_clustering = Toronto_grouped.drop('Neighborhood', 1)

# apply K-Means algorithm 50 times to find the average centroid locations.
kmeans = KMeans(n_clusters=kclusters, init='k-means++', n_init=50, random_state=0).fit(Toronto_grouped_clustering)
kmeans.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 2, 0, 1, 0,
       0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0])

In [15]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
Toronto_merged = Toronto_df

Toronto_merged = Toronto_merged.merge(neighborhoods_venues_sorted)

Toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Pub,Park,Theater,Restaurant,Yoga Studio
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Yoga Studio,Sandwich Place,Theater,Park,Sushi Restaurant
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Tea Room,Ramen Restaurant,Plaza,Pizza Place,Theater
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Restaurant,Seafood Restaurant,Theater,Salon / Barbershop,Poke Place
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,3,Pub,Trail,Yoga Studio,Sake Bar,Ramen Restaurant


Now K-Means has been applied to the data, I will create a separate map displaying the clusters, and then exploring the clusters individually.

In [16]:
# create map of Toronto.
Toronto_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# assign colors to each cluster.
x = np.arange(kclusters)
y = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(y)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add clusters to map.
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged.Latitude, Toronto_merged.Longitude, Toronto_merged.Neighborhood, Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(Toronto_clusters)
       
Toronto_clusters

In [17]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 0, :] 

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Pub,Park,Theater,Restaurant,Yoga Studio
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Yoga Studio,Sandwich Place,Theater,Park,Sushi Restaurant
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Tea Room,Ramen Restaurant,Plaza,Pizza Place,Theater
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Restaurant,Seafood Restaurant,Theater,Salon / Barbershop,Poke Place
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Seafood Restaurant,Restaurant,Shopping Mall,Thai Restaurant,Sushi Restaurant
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,0,Sandwich Place,Salad Place,Yoga Studio,Thai Restaurant,Office
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564,0,Park,Nightclub,Restaurant,Yoga Studio,Record Shop
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568,0,Restaurant,Steakhouse,Thai Restaurant,Sushi Restaurant,Pizza Place
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259,0,Pharmacy,Park,Supermarket,Pizza Place,Yoga Studio
10,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752,0,Scenic Lookout,Restaurant,Park,Pizza Place,Plaza


In [18]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 1, :] 

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
29,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,1,Playground,Tennis Court,Restaurant,Yoga Studio,Poutine Place


In [19]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 2, :] 

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
18,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2,Swim School,Park,Poutine Place,Pub,Ramen Restaurant
33,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,2,Park,Playground,Trail,Yoga Studio,Sake Bar


In [20]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 3, :] 

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,3,Pub,Trail,Yoga Studio,Sake Bar,Ramen Restaurant
21,M5P,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",43.696948,-79.411307,3,Sushi Restaurant,Trail,Sake Bar,Pub,Ramen Restaurant
