# Clustering neighborhoods in Toronto

1. [Get neighborhoods in Toronto](#section_1)
2. [Explore venues among neighborhoods](#section_2)
3. [Cluster neighborhoods based on categories distribution of their venues](#section_3)
4. [Finer resolution clustering within Toronto downtown neighborhoods](#section_4)

<a id='section_1'></a>
## 1. Get neighborhoods in Toronto
### 1.1 Scrape pastal code:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [2]:
import requests
from bs4 import BeautifulSoup

# use BeautifulSoup to scrape the webpage
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
# print(soup.prettify())

In [3]:
# extract the neighborhood table from the page
data = []
table = soup.find('table', attrs={'class':'wikitable sortable'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    if cols:
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols]) # Get rid of empty values

In [4]:
# convert the table data into a dataframe
import pandas as pd
df = pd.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighborhood'])
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [5]:
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
# df['Borough'].unique() #display unique Borough to make sure no other variation of "Not assigned"
df = df[df.Borough != 'Not assigned'].reset_index(drop=True)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [5]:
# such dataframe includes 103 rows (neighborhoods)
df.shape

(103, 3)

### 1.2 Find out the latitude and the longitude coordinates of each neighborhood

In [84]:
# !conda install -c conda-forge geopy --yes 
# !pip install geocoder

In [1]:
# import geocoder # import geocoder

# # initialize your variable to None
# lat_lng_coords = None

# # loop until you get the coordinates
# postal_code = df['Borough'][3]
# while(lat_lng_coords is None):
#   g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#   lat_lng_coords = g.latlng

# latitude = lat_lng_coords[0]
# longitude = lat_lng_coords[1]

In [6]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

geolocator = Nominatim(user_agent="ca_explorer")
# # address = 'New York City, NY'

# latitude = np.empty(df.shape[0])
# longitude = np.empty(df.shape[0])
# latitude[:] = np.nan
# longitude[:] = np.nan

# for ind in range(df.shape[0]):
#     address = '{}, {}, Toronto, Ontario'.format(df['Neighborhood'][ind].split(',')[0],df['Borough'][ind])
#     postal_code = df['PostalCode'][ind]

#     location = geolocator.geocode(address)
#     if location:
#         latitude[ind] = location.latitude
#         longitude[ind] = location.longitude
#     print('The geograpical coordinate of {} ({}) are {}, {}.'.format(address, postal_code, latitude[ind], longitude[ind]))
    
# np.sum(np.isnan(longitude))    

In [6]:
# add new columns to df
# df['Latitude'] = latitude
# df['Longitude'] = longitude
# df.head()

Previous cells were attempts to use geolocator.geocode to convert address into coordinates. However, for about 20% of neighborhood addresses, no valid coordinates were returned. The course provided table thus was directly used.

In [7]:
# alternatively, load coordinates data directly...
postal_data = pd.read_csv('http://cocl.us/Geospatial_data') 
postal_data.head()
# change hearder name to match with the neighborhood dataframe
postal_data.columns = ['PostalCode','Latitude','Longitude']

In [8]:
# merge the two dataframes, so coordinates are attached to each neighborhood
df_to = pd.merge(df, postal_data, how='right', on='PostalCode')
df_to.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [21]:
# Visualize neighborhoods on a map with the coordinates

# !conda install -c conda-forge folium=0.5.0 --yes
import folium
import numpy as np

address = 'Toronto, Ontario'
location = geolocator.geocode(address)
latitude_to = location.latitude
longitude_to = location.longitude

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude_to, longitude_to], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_to['Latitude'], df_to['Longitude'], df_to['Borough'], df_to['Neighborhood']):
    if ~np.isnan(lat):
        label = '{}; {}'.format(neighborhood, borough)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_toronto)  
    else:
        print('Missing: {}; {}.'.format(neighborhood,borough))
    
map_toronto

<a id='section_2'></a>
## 2. Explore venues among neighborhoods

In [44]:
# function to request venues in each neighborhood, identified by [latitudes, longitudes]
CLIENT_ID = 'KDBNNYY3GCQDW5MEI2U0KV1TEPMFC0QIRD2DTEVZHLUB0DAV'
CLIENT_SECRET = 'DYZCEQBM2GJLIXZDLR11LEGBWQRSXRGLB02Y35HWDFCEMKCU' # your Foursquare Secret
VERSION = '20180605' 
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
#         print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [11]:
# request the venues across the neighborhoods
toronto_venues = getNearbyVenues(names=df_to['Neighborhood'],
                                 latitudes=df_to['Latitude'],
                                 longitudes=df_to['Longitude'])
# each row contains one venue returned
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Parkwoods,43.753259,-79.329656,Corrosion Service Company Limited,43.752432,-79.334661,Construction & Landscaping
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [12]:
# list how many venues were returned for each neighborhood
toronto_venues.groupby('Neighborhood').count()['Venue']

Neighborhood
Agincourt                                           5
Alderwood, Long Branch                              8
Bathurst Manor, Wilson Heights, Downsview North    20
Bayview Village                                     4
Bedford Park, Lawrence Manor East                  23
                                                   ..
Willowdale, Willowdale West                         6
Woburn                                              3
Woodbine Heights                                    6
York Mills West                                     2
York Mills, Silver Hills                            1
Name: Venue, Length: 95, dtype: int64

In [13]:
# Venue category representation for each neighborhood

# one hot encoding for venue category
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head() # each row coded for one venue in that neighborhood

Unnamed: 0,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [62]:
# summarize how many venues are there under each category for each neighborhood, each row should sum up to 1
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped
# np.sum(toronto_grouped.iloc[0, 1:-1].to_numpy())

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
91,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
92,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
93,York Mills West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [63]:
# corporate total number of venues for each neighborhood
# it can be an important feature, that some neighborhoods are more popular in general in terms of venues than others
toronto_venue_total = toronto_venues.groupby('Neighborhood').count()['Venue'].reset_index()
toronto_venue_total.columns = ['Neighborhood','TotalVenue']
toronto_venue_total['TotalVenue'] = toronto_venue_total['TotalVenue']/LIMIT/200
toronto_grouped = toronto_grouped.join(toronto_venue_total.set_index('Neighborhood'), on='Neighborhood')
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,TotalVenue
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0005
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0008
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0020
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0004
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0006
91,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0003
92,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0006
93,York Mills West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0002


In [16]:
# display top 5 venues for each neighborhood
# num_top_venues = 5
# for hood in toronto_grouped['Neighborhood']:
#     print("----"+hood+"----")
#     temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
#     temp.columns = ['venue','freq']
#     temp = temp.iloc[1:-2]
#     temp['freq'] = temp['freq'].astype(float)
#     temp = temp.round({'freq': 2})
#     print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
#     print('\n')

In [65]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:-2]#exclude 'TotalVenue'
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [64]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Agincourt,Skating Rink,Clothing Store,Latin American Restaurant,Lounge,Breakfast Spot
1,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Sandwich Place,Pub,Pool
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Pharmacy,Shopping Mall,Sandwich Place
3,Bayview Village,Café,Japanese Restaurant,Chinese Restaurant,Bank,Wings Joint
4,"Bedford Park, Lawrence Manor East",Italian Restaurant,Sandwich Place,Restaurant,Coffee Shop,Pizza Place


<a id='section_3'></a>
## 3. Cluster neighborhoods based on categories distribution of their venues

In [81]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 10

# toronto_grouped_clustering = toronto_grouped.drop(['Neighborhood', 'TotalVenue'], 1)
toronto_grouped_clustering = toronto_grouped.drop(['Neighborhood'], 1)

In [82]:
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# # check cluster labels generated for each row in the dataframe
# kmeans.labels_[0:10] 


# add clustering labels
# neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
neighborhoods_venues_sorted['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = df_to
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged = toronto_merged.join(toronto_venue_total.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,Cluster Labels,TotalVenue
0,M3A,North York,Parkwoods,43.753259,-79.329656,Park,Construction & Landscaping,Food & Drink Shop,Cuban Restaurant,Doner Restaurant,6.0,0.0003
1,M4A,North York,Victoria Village,43.725882,-79.315572,Coffee Shop,Pizza Place,Portuguese Restaurant,Intersection,Hockey Arena,8.0,0.0005
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,8.0,0.0044
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,Clothing Store,Furniture / Home Store,Accessories Store,Coffee Shop,Vietnamese Restaurant,8.0,0.0015
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Coffee Shop,Sushi Restaurant,Diner,Yoga Studio,Sandwich Place,8.0,0.0034


In [83]:
# visualize the clusters
# note that this is clustering also considered popularity in general 
# (characterized by # of venues returned from Place API) for each neighborhood

import matplotlib.cm as cm
import matplotlib.colors as colors


# create map
map_clusters = folium.Map(location=[latitude_to, longitude_to], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='#000000' if np.isnan(cluster) else rainbow[int(cluster)-1],
        fill=True,
        fill_color='#000000' if np.isnan(cluster) else rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters
# downtown Toronto neighborhoods are well clustered

In [85]:
# inspect each cluster by glancing over most common venues 
icateg = 1
toronto_merged.loc[toronto_merged['Cluster Labels'] == icateg, toronto_merged.columns[[2] + \
                   list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,Cluster Labels,TotalVenue
83,"Moore Park, Summerhill East",Playground,Park,Restaurant,Dog Run,Distribution Center,1.0,0.0003
85,"Milliken, Agincourt North, Steeles East, L'Amo...",Playground,Park,Dim Sum Restaurant,Cupcake Shop,Curling Ice,1.0,0.0002
91,Rosedale,Park,Trail,Playground,Cupcake Shop,Curling Ice,1.0,0.0004


<a id='section_4'></a>
## 4. Finer resolution clustering within Toronto downtown neighborhoods

In [113]:
# take the dataframe df_to to filter neighborhoods for which the borough containing "Toronto"
df_to['Borough'].unique() # a total of 10 boroughs in Toronto, 
                          # borough of interest: 'Downtown Toronto', 'East Toronto', 'West Toronto', 'Central Toronto'
df_toronto_downtown = df_to[(df_to['Borough']=='Downtown Toronto') | (df_to['Borough']=='East Toronto') | \
               (df_to['Borough']=='West Toronto') | (df_to['Borough']=='Central Toronto')].reset_index(drop=True)
# df_toronto_downtown = df_to[(df_to['Borough']=='Downtown Toronto')].reset_index(drop=True)
df_toronto_downtown.shape

(39, 5)

In [114]:
# # Visualize Toronto downtown areas using latitude and longitude values
# map_toronto = folium.Map(location=[latitude_to, longitude_to], zoom_start=12)

# # add markers to map
# for lat, lng, borough, neighborhood in zip(df_toronto_downtown['Latitude'], df_toronto_downtown['Longitude'], 
#                                            df_toronto_downtown['Borough'], df_toronto_downtown['Neighborhood']):
#     if ~np.isnan(lat):
#         label = '{}; {}'.format(neighborhood, borough)
#         label = folium.Popup(label, parse_html=True)
#         folium.CircleMarker(
#             [lat, lng],
#             radius=5,
#             popup=label,
#             color='blue',
#             fill=True,
#             fill_color='#3186cc',
#             fill_opacity=0.7,
#             parse_html=False).add_to(map_toronto)  
#     else:
#         print('Missing: {}; {}.'.format(neighborhood,borough))
    
# map_toronto

In [61]:
# get a sense of distances among neighborhoods
import geopy.distance

neighbor1 = 5
neighbor2 = 34
print(df_toronto_downtown['Neighborhood'][neighbor1])
print(df_toronto_downtown['Neighborhood'][neighbor2])
coords_1 = (df_toronto_downtown['Latitude'][neighbor1], df_toronto_downtown['Longitude'][neighbor1])
coords_2 = (df_toronto_downtown['Latitude'][neighbor2], df_toronto_downtown['Longitude'][neighbor2])

geopy.distance.vincenty(coords_1, coords_2).km

Berczy Park
Stn A PO Boxes


  # This is added back by InteractiveShellApp.init_path()


0.2227709903987258

In [115]:
# request the venues across the neighborhoods
toronto_dt_venues = getNearbyVenues(names=df_toronto_downtown['Neighborhood'],
                                 latitudes=df_toronto_downtown['Latitude'],
                                 longitudes=df_toronto_downtown['Longitude'])
# each row contains one venue returned
toronto_dt_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Dominion Pub and Kitchen,43.656919,-79.358967,Pub


In [116]:
# venue representations based on returned venue category
# one hot encoding for venue category
toronto_dt_onehot = pd.get_dummies(toronto_dt_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_dt_onehot['Neighborhood'] = toronto_dt_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_dt_onehot.columns[-1]] + list(toronto_dt_onehot.columns[:-1])
toronto_dt_onehot = toronto_dt_onehot[fixed_columns]

toronto_dt_onehot.head() # each row coded for one venue in that neighborhood

# summarize how many venues are there under each category for each neighborhood, each row should sum up to 1
toronto_dt_grouped = toronto_dt_onehot.groupby('Neighborhood').mean().reset_index()

# factor in the feature of overall popularity
toronto_venue_total = toronto_dt_venues.groupby('Neighborhood').count()['Venue'].reset_index()
toronto_venue_total.columns = ['Neighborhood','TotalVenue']
toronto_venue_total['TotalVenue'] = toronto_venue_total['TotalVenue']/LIMIT/200
toronto_dt_grouped = toronto_dt_grouped.join(toronto_venue_total.set_index('Neighborhood'), on='Neighborhood')

toronto_dt_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,TotalVenue
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0028
1,"Brockton, Parkdale Village, Exhibition Place",0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0012
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0008
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0625,0.0625,0.0625,0.125,0.1875,0.125,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0008
4,Central Bay Street,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.015625,0.0,0.0,0.015625,0.0,0.0,0.0032


In [117]:
# dataframe with most common venues for summary dataframe
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_dt_grouped['Neighborhood']

for ind in np.arange(toronto_dt_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_dt_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Cheese Shop,Bakery,Beer Bar
1,"Brockton, Parkdale Village, Exhibition Place",Café,Bakery,Coffee Shop,Breakfast Spot,Yoga Studio
2,"Business reply mail Processing Centre, South C...",Skate Park,Park,Fast Food Restaurant,Farmers Market,Burrito Place
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Sculpture Garden,Rental Car Location
4,Central Bay Street,Coffee Shop,Sandwich Place,Italian Restaurant,Café,Japanese Restaurant


In [122]:
# set number of clusters
kclusters = 5

# toronto_grouped_clustering = toronto_grouped.drop(['Neighborhood', 'TotalVenue'], 1)
toronto_dt_grouped_clustering = toronto_dt_grouped.drop(['Neighborhood'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_dt_grouped_clustering)

# # check cluster labels generated for each row in the dataframe
# kmeans.labels_[0:10] 


# add clustering labels
# neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
neighborhoods_venues_sorted['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_dt_merged = df_toronto_downtown
toronto_dt_merged = toronto_dt_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_dt_merged = toronto_dt_merged.join(toronto_venue_total.set_index('Neighborhood'), on='Neighborhood')

toronto_dt_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,Cluster Labels,TotalVenue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,2,0.0022
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Coffee Shop,Sushi Restaurant,Diner,Yoga Studio,Distribution Center,2,0.0017
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,Clothing Store,Coffee Shop,Café,Cosmetics Shop,Japanese Restaurant,2,0.005
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,Coffee Shop,Café,Restaurant,Gastropub,Cocktail Bar,2,0.0039
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,Asian Restaurant,Pub,Trail,Health Food Store,Dim Sum Restaurant,0,0.00025


In [123]:
# visualize the clusters
# note that this is clustering also considered popularity in general 
# (characterized by # of venues returned from Place API) for each neighborhood

import matplotlib.cm as cm
import matplotlib.colors as colors


# create map
map_clusters = folium.Map(location=[latitude_to, longitude_to], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_dt_merged['Latitude'], toronto_dt_merged['Longitude'], 
                                  toronto_dt_merged['Neighborhood'], toronto_dt_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='#000000' if np.isnan(cluster) else rainbow[int(cluster)-1],
        fill=True,
        fill_color='#000000' if np.isnan(cluster) else rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters
# downtown Toronto neighborhoods are well clustered

In [112]:
# inspect each cluster by glancing over most common venues 
icateg = 2
toronto_dt_merged.loc[toronto_dt_merged['Cluster Labels'] == icateg, toronto_dt_merged.columns[[2] + \
                   list(range(5, toronto_dt_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,Cluster Labels,TotalVenue
0,"Regent Park, Harbourfront",Coffee Shop,Pub,Park,Bakery,Theater,2,0.0022
2,"Garden District, Ryerson",Clothing Store,Coffee Shop,Bubble Tea Shop,Japanese Restaurant,Italian Restaurant,2,0.005
3,St. James Town,Coffee Shop,Café,American Restaurant,Restaurant,Gastropub,2,0.0039
4,Berczy Park,Coffee Shop,Cocktail Bar,Restaurant,Beer Bar,Seafood Restaurant,2,0.0028
7,"Richmond, Adelaide, King",Coffee Shop,Café,Restaurant,Gym,Hotel,2,0.0046
8,"Harbourfront East, Union Station, Toronto Islands",Coffee Shop,Aquarium,Café,Hotel,Fried Chicken Joint,2,0.005
9,"Toronto Dominion Centre, Design Exchange",Coffee Shop,Hotel,Café,Restaurant,American Restaurant,2,0.005
10,"Commerce Court, Victoria Hotel",Coffee Shop,Restaurant,Café,Hotel,American Restaurant,2,0.005
11,"University of Toronto, Harbord",Café,Bar,Italian Restaurant,Japanese Restaurant,Restaurant,2,0.0017
12,"Kensington Market, Chinatown, Grange Park",Café,Vegetarian / Vegan Restaurant,Coffee Shop,Bakery,Mexican Restaurant,2,0.0031
