<h1 align=center><font size = 6>Segmenting and Clustering Neighborhoods in Toronto--part 3</font></h1>

<h1 align=center><font size = 5>Su Yiping</font></h1>

**Explore and cluster the neighborhoods in Toronto** 
1. Generate maps to visualize your neighborhoods and how they cluster together
2. Decide to work with only boroughs that contain the word Toronto and then replicate the same analysis

In [1]:
#install the folium package
!pip install folium



In [2]:
# import the libralies
import pandas as pd
import folium   # map rendering library
from geopy.geocoders import Nominatim  # OSM(OpenStreetMap) data
import json # library to handle JSON files
from sklearn.cluster import KMeans
import requests
from tqdm import tqdm
from collections import deque
import matplotlib.cm as cm
import matplotlib.colors as colors
import numpy as np

# Loading data

Load the data generated in the part 2 and verify the amount of boroughs and neigborhoods in the data

In [3]:
#load the dataframe from the csv file
df = pd.read_csv('toronto_2.csv')
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.811525,-79.195517
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill",43.785665,-79.158725
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765815,-79.175193
3,M1G,Scarborough,Woburn,43.768369,-79.21759
4,M1H,Scarborough,Cedarbrae,43.769688,-79.23944


Using geolocator to get the geographical coordinates of Toronto

In [4]:
address = 'Toronto, Canada'

geolocator = Nominatim() # get OSM(OpenStreetMap) data
location = geolocator.geocode(address) # get the geogragraphical coordinates of Toronto
toronto_latitude = location.latitude
toronto_longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(toronto_latitude, toronto_longitude))

  app.launch_new_instance()


The geograpical coordinate of Toronto are 43.653963, -79.387207.


Create map of toronto using latitude and longitude values

In [5]:
# for the city Toronto, latitude and longtitude are manually extracted via google search
map_toronto = folium.Map(location = [toronto_latitude, toronto_longitude], zoom_start = 10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  

map_toronto

In [7]:
#Configure Foursquare access
CLIENT_ID = 'TCJK5SCFXOAOICUOCPMYEQ1TMHFQ3QUTLJPHGXDYQHJFMDMX' # your Foursquare ID
CLIENT_SECRET = 'QUW4QNIGHKPXFQ5U3QC1LLYNNUCGLVRPCSID3IWYTFR25X40' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: TCJK5SCFXOAOICUOCPMYEQ1TMHFQ3QUTLJPHGXDYQHJFMDMX
CLIENT_SECRET:QUW4QNIGHKPXFQ5U3QC1LLYNNUCGLVRPCSID3IWYTFR25X40


**Make a function that takes the names and locations of the neighborhoods in Toronto**

In [8]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT = 100):
    
    venues_list=[]
    for name, lat, lng in tqdm(zip(names, latitudes, longitudes), total = names.size):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [9]:
# obtains the 100 top venues around
Toronto_venues = getNearbyVenues(df.Neighborhood,df.Latitude, df.Longitude)

100%|██████████| 103/103 [00:27<00:00,  4.70it/s]


In [11]:
# the shape of my data 
Toronto_venues.shape

(2468, 7)

In [12]:
# a sample of the initial rows of information
Toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.811525,-79.195517,Canadian Appliance Source Whitby,43.808353,-79.191331,Home Service
1,"Highland Creek, Port Union, Rouge Hill",43.785665,-79.158725,Affordable Toronto Movers,43.787919,-79.162977,Moving Target
2,"Highland Creek, Port Union, Rouge Hill",43.785665,-79.158725,Scarborough Historical Society,43.788755,-79.162438,History Museum
3,"Highland Creek, Port Union, Rouge Hill",43.785665,-79.158725,Royal Canadian Legion,43.782533,-79.163085,Bar
4,"Guildwood, Morningside, West Hill",43.765815,-79.175193,Homestead Roofing Repair,43.76514,-79.178663,Construction & Landscaping


In [13]:
# Check the amount of venues per neighboorhood
Toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Agincourt,11,11,11,11,11,11
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",1,1,1,1,1,1
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",17,17,17,17,17,17
"Alderwood, Long Branch",4,4,4,4,4,4
"Bathurst Manor, Downsview North, Wilson Heights",1,1,1,1,1,1
"Bathurst Quay, CN Tower, Harbourfront West, Island airport, King and Spadina, Railway Lands, South Niagara",68,68,68,68,68,68
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
Berczy Park,61,61,61,61,61,61


**Find out how many unique categories can be curated from all the returned venues**

In [14]:
# the total amount of unique categories in my data
len(Toronto_venues['Venue Category'].unique())

263

### Analyze Each Neighborhood

In [15]:
# take the venue category information and create a dataframe with a one hot enconding of these data
Toronto_onehot = pd.get_dummies(Toronto_venues["Venue Category"],
                             prefix = "",
                             prefix_sep = "")

Toronto_onehot["Neighborhood"] = Toronto_venues["Neighborhood"]


nindex = list(Toronto_onehot.columns).index("Neighborhood")
cols = deque(Toronto_onehot.columns)
cols.rotate(-nindex)
cols = list(cols)
Toronto_onehot = Toronto_onehot[cols]

Toronto_onehot.head()

Unnamed: 0,Neighborhood,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Outdoor Sculpture,...,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Movie Theater,Moving Target,Museum,Music Store,Music Venue
0,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Highland Creek, Port Union, Rouge Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,"Highland Creek, Port Union, Rouge Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Highland Creek, Port Union, Rouge Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
Toronto_onehot.shape

(2468, 263)

### Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [17]:
#compute the average number of venue categories per neighborhood
toronto_grouped = Toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Outdoor Sculpture,...,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Movie Theater,Moving Target,Museum,Music Store,Music Venue
0,"Adelaide, King, Richmond",0.010000,0.000000,0.010000,0.010000,0.01,0.000000,0.000000,0.0,0.00000,...,0.000000,0.000000,0.000000,0.000000,0.01,0.000000,0.0,0.000000,0.000000,0.000000
1,Agincourt,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.0,0.00000,...,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,0.000000
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.0,0.00000,...,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,0.000000
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.0,0.00000,...,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,0.000000
4,"Alderwood, Long Branch",0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.0,0.00000,...,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,0.000000
5,"Bathurst Manor, Downsview North, Wilson Heights",0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.0,0.00000,...,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,0.000000
6,"Bathurst Quay, CN Tower, Harbourfront West, Is...",0.014706,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.0,0.00000,...,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,0.000000
7,Bayview Village,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.0,0.00000,...,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,0.000000
8,"Bedford Park, Lawrence Manor East",0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.0,0.00000,...,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.0,0.000000,0.000000,0.000000
9,Berczy Park,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.0,0.00000,...,0.000000,0.000000,0.000000,0.016393,0.00,0.000000,0.0,0.016393,0.000000,0.000000


In [18]:
# size of dataframe 
toronto_grouped.shape

(100, 263)

**Make a function that is the N most frequent venues** 

In [19]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [20]:
#  Create the new dataframe and display the top 10 venues for each neighborhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Restaurant,Hotel,Gym,Breakfast Spot,Japanese Restaurant,Steakhouse,Gastropub,Asian Restaurant
1,Agincourt,Shopping Mall,Sushi Restaurant,Vietnamese Restaurant,Bubble Tea Shop,Supermarket,Grocery Store,Pool,Shanghai Restaurant,Chinese Restaurant,Bakery
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Pharmacy,Music Venue,Art Gallery,Yoga Studio,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Arts & Crafts Store,Wings Joint
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Liquor Store,Fast Food Restaurant,Hardware Store,Beer Store,Auto Garage,Gym Pool,Pizza Place,Video Store,Pharmacy
4,"Alderwood, Long Branch",Convenience Store,Gym,Pub,Performing Arts Venue,Art Gallery,Yoga Studio,Afghan Restaurant,Airport,American Restaurant,Antique Shop


# Neighborhood clustering

Run *k*-means to cluster the neighborhood into 5 clusters.

In [21]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
print(kmeans.labels_[0:10])
print(kmeans.labels_.shape)

[0 0 2 0 0 4 0 0 0 0]
(100,)


### Create a dataframe that containes the neighborhood, the location and the cluster information, together with the top 10 venues

In [22]:
toronto_grouped["Cluster Labels"] = kmeans.labels_

# add clustering labels
Toronto_combined = df.merge(toronto_grouped, left_on = "Neighborhood", right_on = "Neighborhood", how = "outer")

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Toronto_combined = Toronto_combined.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

Toronto_combined["Cluster Labels"] = Toronto_combined["Cluster Labels"].fillna(5).astype("int")

Toronto_combined.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,New American Restaurant,Nightclub,Noodle House,Office,Opera House,...,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.811525,-79.195517,0.0,0.0,0.0,0.0,0.0,...,Home Service,Music Venue,Wings Joint,BBQ Joint,Auto Garage,Auto Dealership,Athletics & Sports,Asian Restaurant,Arts & Crafts Store,Art Gallery
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill",43.785665,-79.158725,0.0,0.0,0.0,0.0,0.0,...,History Museum,Moving Target,Bar,Antique Shop,Women's Store,Yoga Studio,Afghan Restaurant,Airport,American Restaurant,Arts & Crafts Store
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765815,-79.175193,0.0,0.0,0.0,0.0,0.0,...,Construction & Landscaping,Park,Gym / Fitness Center,Women's Store,Yoga Studio,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Art Gallery
3,M1G,Scarborough,Woburn,43.768369,-79.21759,0.0,0.0,0.0,0.0,0.0,...,Business Service,Park,Coffee Shop,Korean Restaurant,Yoga Studio,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Art Gallery
4,M1H,Scarborough,Cedarbrae,43.769688,-79.23944,0.0,0.0,0.0,0.0,0.0,...,Playground,Trail,Music Venue,Arts & Crafts Store,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Art Gallery,Asian Restaurant


In [23]:
# create map
map_clusters = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=10)

kclusters = kclusters + 1

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_combined['Latitude'],
                                  Toronto_combined['Longitude'],
                                  Toronto_combined['Neighborhood'],
                                  Toronto_combined['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# Analyze clusters

### Cluster 1

In [24]:
Toronto_combined.loc[Toronto_combined['Cluster Labels'] == 0, "1st Most Common Venue":"10th Most Common Venue"].head()

Unnamed: 0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,History Museum,Moving Target,Bar,Antique Shop,Women's Store,Yoga Studio,Afghan Restaurant,Airport,American Restaurant,Arts & Crafts Store
4,Playground,Trail,Music Venue,Arts & Crafts Store,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Art Gallery,Asian Restaurant
5,Train Station,Restaurant,Grocery Store,Indian Restaurant,Arts & Crafts Store,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Art Gallery
6,Discount Store,Hobby Shop,Department Store,Coffee Shop,Convenience Store,Women's Store,Yoga Studio,Afghan Restaurant,Airport,American Restaurant
7,Bakery,Bus Line,Coffee Shop,Soccer Field,Metro Station,Bus Station,Intersection,Toy / Game Store,Yoga Studio,BBQ Joint


### Cluster 2

In [25]:
Toronto_combined.loc[Toronto_combined['Cluster Labels'] == 1, "1st Most Common Venue":"10th Most Common Venue"].head()

Unnamed: 0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Home Service,Music Venue,Wings Joint,BBQ Joint,Auto Garage,Auto Dealership,Athletics & Sports,Asian Restaurant,Arts & Crafts Store,Art Gallery
32,Business Service,Home Service,Antique Shop,Women's Store,Yoga Studio,Afghan Restaurant,Airport,American Restaurant,Music Venue,Wine Shop


### Cluster 3 

In [26]:
Toronto_combined.loc[Toronto_combined['Cluster Labels'] == 2, "1st Most Common Venue":"10th Most Common Venue"].head()

Unnamed: 0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Pharmacy,Music Venue,Art Gallery,Yoga Studio,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Arts & Crafts Store,Wings Joint


### Cluster 4 

In [27]:
Toronto_combined.loc[Toronto_combined['Cluster Labels'] == 3, "1st Most Common Venue":"10th Most Common Venue"].head()

Unnamed: 0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Construction & Landscaping,Park,Gym / Fitness Center,Women's Store,Yoga Studio,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Art Gallery
3,Business Service,Park,Coffee Shop,Korean Restaurant,Yoga Studio,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Art Gallery
25,Food & Drink Shop,Park,Antique Shop,Women's Store,Yoga Studio,Afghan Restaurant,Airport,American Restaurant,Music Venue,Wings Joint
34,Park,Food Stand,Grocery Store,Music Venue,Yoga Studio,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Art Gallery
64,Park,Music Venue,Art Gallery,Yoga Studio,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Arts & Crafts Store,Wings Joint


### Cluster 5 

In [28]:
Toronto_combined.loc[Toronto_combined['Cluster Labels'] == 4, "1st Most Common Venue":"10th Most Common Venue"].head()

Unnamed: 0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
28,Men's Store,Music Venue,Antique Shop,Women's Store,Yoga Studio,Afghan Restaurant,Airport,American Restaurant,Art Gallery,Wine Shop
