# <font color='blue'><center>Exploring the Toronto Neighborhoods</center></font>

## <font color='green'><center>Part 3 - Segmenting & Clustering</center></font>

In [93]:
# Install dependencies
#!conda install -c conda-forge geopy --yes 
#!conda install -c conda-forge folium=0.5.0 --yes

#These libraries are already installed. Hence, making this a markdown cell.

In [94]:
# Import libraries
import folium
import requests 
import json 
import matplotlib.cm as cm
import matplotlib.colors as colors
import pandas as pd
import numpy as np

from pandas.io.json import json_normalize 
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim

<b>First, let us load the data from the CSV file genrated in the previous step.</b>

In [95]:
# Define the path of the input CSV file
dataFilePath = 'TorontoNeighborhoods.csv'

In [96]:
# Load the CSV file into a pandas dataframe
torontoNeighborhoods = pd.read_csv(dataFilePath)
torontoNeighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


<b>Now, we filter out all the Boroughs with the word <font color='blue'><i>Toronto</i></font> in it.</b>

In [97]:
toronto_data = torontoNeighborhoods[torontoNeighborhoods['Borough'].str.contains('Toronto')].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


<b>Let us get the latitude and longitude values of Toronto city using the geopy library.</b>

In [98]:
address = 'Toronto, ON, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('Geograpical coordinates of Toronto: {}, {}.'.format(latitude, longitude))

Geograpical coordinates of Toronto: 43.6534817, -79.3839347.


<b>Now let us create a map of Toronto and superimpose the neighborhoods in our dataset on the same.</b>

In [99]:
# Create map of Toronto using the latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# Add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], 
                                           toronto_data['Borough'], toronto_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<b>Now that we have our neighborhoods shortlisted, let us use the Foursquare API to fetch and analyze the venues near to our neighborhoods.</b>

In [100]:
# Define the Foursquare API credentials
CLIENT_ID = 'ME1O3CMRJSLUPNPVD21343UONQ1YUXKTIENBET0CIC5Z224H'
CLIENT_SECRET = 'AX11LM10OEQQ54AH1WXGLW3GUHNKQ2LTHTELTO5KJUV5IBC3'
VERSION = '20180605' # Foursquare API version

<b>Let us now create a dataset containing the top 100 venues within a radius of 500 metres of each neighborhood.</b>

In [101]:
# Define the number of venues needed and the radius
LIMIT = 100
RADIUS = 500
URL_SKELETON = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'

# Define a function to return the top 100 venues within a radius of 500m
def getNearbyVenues(postalCodes, boroughs, names, latitudes, longitudes):
    venues_list=[]
    for post, borough, name, lat, lng in zip(postalCodes, boroughs, names, latitudes, longitudes):
            
        # Create the API request URL
        url = URL_SKELETON.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            RADIUS, 
            LIMIT)
            
        # GET request
        results = requests.get(url).json()["response"]["groups"][0]["items"]
        
        # Retrieve only the relevant information for each venue
        venues_list.append([(
            post, 
            borough,
            name,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode',
                             'Borough',
                             'Neighborhood', 
                             'Neighborhood_Latitude', 
                             'Neighborhood_Longitude', 
                             'Venue', 
                             'Venue_Latitude', 
                             'Venue_Longitude', 
                             'Venue_Category']

    return(nearby_venues)

In [102]:
venues_df = getNearbyVenues(toronto_data['PostalCode'],
                            toronto_data['Borough'],
                            toronto_data['Neighborhood'], 
                            toronto_data['Latitude'], 
                            toronto_data['Longitude'])

In [103]:
print(venues_df.shape)
venues_df.head()

(1600, 9)


Unnamed: 0,PostalCode,Borough,Neighborhood,Neighborhood_Latitude,Neighborhood_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
1,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


<b>Let us gain some quick stats from our dataset.</b>

<b><font color='blue'>1. Venues per Postal Code</font></b>

In [104]:
venues_df.groupby(["PostalCode", "Borough", "Neighborhood"]).count().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Neighborhood_Latitude,Neighborhood_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
PostalCode,Borough,Neighborhood,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
M4E,East Toronto,The Beaches,5,5,5,5,5,5
M4K,East Toronto,"The Danforth West, Riverdale",43,43,43,43,43,43
M4L,East Toronto,"India Bazaar, The Beaches West",17,17,17,17,17,17
M4M,East Toronto,Studio District,37,37,37,37,37,37
M4N,Central Toronto,Lawrence Park,3,3,3,3,3,3


<b><font color='blue'>2. Venues per Neighborhood</font></b>

In [105]:
venues_df.groupby("Neighborhood").count().head()

Unnamed: 0_level_0,PostalCode,Borough,Neighborhood_Latitude,Neighborhood_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Berczy Park,57,57,57,57,57,57,57,57
"Brockton, Parkdale Village, Exhibition Place",24,24,24,24,24,24,24,24
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",16,16,16,16,16,16,16,16
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16,16,16
Central Bay Street,63,63,63,63,63,63,63,63


<b><font color='blue'>3. Unique Venue Categories</font></b>

In [106]:
print('Count of unique venue categories in the dataset: {}'.format(len(venues_df['Venue_Category'].unique())))
print('Few sample values of venue categories: ')
venues_df['Venue_Category'].unique()[:10].tolist()

Count of unique venue categories in the dataset: 235
Few sample values of venue categories: 


['Coffee Shop',
 'Bakery',
 'Distribution Center',
 'Restaurant',
 'Spa',
 'Park',
 'Pub',
 'Breakfast Spot',
 'Gym / Fitness Center',
 'Historic Site']

### <font color='green'>Analyzing Venues in the Neighborhoods</font>

In [107]:
# One hot encoding to add one column per category
toronto_venues_df = pd.get_dummies(venues_df[['Venue_Category']], prefix="", prefix_sep="")

# Add the columns 'PostalCode', 'Borough', Neighborhood' to dataframe
toronto_venues_df['PostalCode'] = venues_df['PostalCode'] 
toronto_venues_df['Borough'] = venues_df['Borough'] 
toronto_venues_df['Neighborhood'] = venues_df['Neighborhood']

# Move the above three columns to the first three positions
new_columns = list(toronto_venues_df.columns[-2:]) + list(toronto_venues_df.columns[:-3])
toronto_venues_df = toronto_venues_df[new_columns]

toronto_venues_df.head()

Unnamed: 0,PostalCode,Borough,Adult Boutique,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theater,Theme Restaurant,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar
0,M5A,Downtown Toronto,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M5A,Downtown Toronto,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M5A,Downtown Toronto,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M5A,Downtown Toronto,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M5A,Downtown Toronto,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<b>Let us now look at the frequency of the venues in each neighborhood.</b>

In [108]:
venue_freq_df = toronto_venues_df.groupby('Neighborhood').mean().reset_index()
venue_freq_df.head()

Unnamed: 0,Neighborhood,Adult Boutique,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Theater,Theme Restaurant,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0625,0.0625,0.125,0.125,0.125,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.015873,0.0,0.0,0.015873


<b>Below, we find the top 10 common venues (based on category) for each neighborhood.</b>

In [109]:
# Function to create a list of column names as per the number of top venues 
def get_freq_col_list(venues_count):
    freq_indicator_suffixes = ['st', 'nd', 'rd', 'th']
    freq_cols = []
    for i in np.arange(1, venues_count+1):
        if i in [1, 2, 3]:
            freq_cols.append('{}{} Most Common Venue'.format(i, freq_indicator_suffixes[i-1]))
        elif i >= 4 and i <=20:
            freq_cols.append('{}{} Most Common Venue'.format(i, freq_indicator_suffixes[3]))
        else:
            rem = i % 10
            if rem in [1, 2, 3]:
                freq_cols.append('{}{} Most Common Venue'.format(i, freq_indicator_suffixes[rem-1]))
            else:
                freq_cols.append('{}{} Most Common Venue'.format(i, freq_indicator_suffixes[3]))
    return freq_cols

In [110]:
# Create a new dataframe with the most common venues
top_venues_count = 10
neighborhood_col = ['Neighborhood']
toronto_venues_sorted_df_columns = neighborhood_col + get_freq_col_list(top_venues_count)
toronto_venues_sorted_df = pd.DataFrame(columns=toronto_venues_sorted_df_columns)

toronto_venues_sorted_df['Neighborhood'] = venue_freq_df['Neighborhood']
for ind in np.arange(venue_freq_df.shape[0]):
    row_categories = venue_freq_df.iloc[ind, :].iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    toronto_venues_sorted_df.iloc[ind, 1:] = row_categories_sorted.index.values[0:top_venues_count]

toronto_venues_sorted_df.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Bakery,Cocktail Bar,Seafood Restaurant,Restaurant,Beer Bar,Farmers Market,Pharmacy,Cheese Shop,Belgian Restaurant
1,"Brockton, Parkdale Village, Exhibition Place",Café,Breakfast Spot,Nightclub,Bakery,Coffee Shop,Pet Store,Climbing Gym,Stadium,Burrito Place,Restaurant
2,"Business reply mail Processing Centre, South C...",Comic Shop,Garden Center,Farmers Market,Light Rail Station,Fast Food Restaurant,Burrito Place,Restaurant,Recording Studio,Brewery,Auto Workshop
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Lounge,Airport Service,Airport Terminal,Harbor / Marina,Bar,Plane,Coffee Shop,Rental Car Location,Sculpture Garden,Boat or Ferry
4,Central Bay Street,Coffee Shop,Sandwich Place,Italian Restaurant,Café,Salad Place,Burger Joint,Department Store,Japanese Restaurant,Thai Restaurant,Bubble Tea Shop


### <font color='green'>Clustering the Neighborhoods</font>

<b>We now use k-means clustering algorithm to cluster the neighborhood into 5 different clusters.</b>

In [111]:
# Set number of clusters
kclusters = 5
toronto_clustering = venue_freq_df.drop(['Neighborhood'], 1)

# Use k-means clustering algorithm to create clusters
kmeans_clusters = KMeans(n_clusters=kclusters, random_state=1).fit(toronto_clustering)

<b>Next, we create a new dataframe that contains the cluster as well the top 10 venues for each neighborhood.</b>

In [112]:
toronto_venues_sorted_df.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged_df = toronto_data.copy()
toronto_merged_df = toronto_merged_df.merge(toronto_venues_sorted_df.set_index('Neighborhood'), on='Neighborhood')

toronto_merged_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,Café,Restaurant,Theater,Performing Arts Venue,Dessert Shop
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Diner,Sushi Restaurant,Italian Restaurant,Bar,Bank,Mexican Restaurant,Beer Bar,Fried Chicken Joint,Portuguese Restaurant
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Clothing Store,Coffee Shop,Café,Cosmetics Shop,Italian Restaurant,Middle Eastern Restaurant,Bubble Tea Shop,Pizza Place,Electronics Store,Diner
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Café,Coffee Shop,Cocktail Bar,Cosmetics Shop,Gastropub,Creperie,Lingerie Store,Beer Bar,Farmers Market,Restaurant
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Park,Health Food Store,Trail,Pub,Wine Bar,Dumpling Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop


<b> Let us now visualize the clusters.</b>

In [114]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged_df['Latitude'], toronto_merged_df['Longitude'], toronto_merged_df['Neighborhood'], kmeans_clusters.labels_):
    label = folium.Popup(str(poi) + " - Cluster " + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon], 
        radius=5, 
        popup=label, 
        color=rainbow[cluster-1], 
        fill=True, 
        fill_color=rainbow[cluster-1], 
        fill_opacity=0.7).add_to(map_clusters)
    
map_clusters

<b>Each cluster can be examined and the discriminating venue categories that distinguish the clusters, can be determined, like how its done for the first cluster, <font color='blue'>Cluster 0</font> below.</b>

In [115]:
toronto_merged_df.loc[toronto_merged_df['Cluster Labels'] == 0, toronto_merged_df.columns[[1] + list(range(5, toronto_merged_df.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,0,Coffee Shop,Bakery,Pub,Park,Breakfast Spot,Café,Restaurant,Theater,Performing Arts Venue,Dessert Shop
1,Downtown Toronto,0,Coffee Shop,Diner,Sushi Restaurant,Italian Restaurant,Bar,Bank,Mexican Restaurant,Beer Bar,Fried Chicken Joint,Portuguese Restaurant
2,Downtown Toronto,0,Clothing Store,Coffee Shop,Café,Cosmetics Shop,Italian Restaurant,Middle Eastern Restaurant,Bubble Tea Shop,Pizza Place,Electronics Store,Diner
3,Downtown Toronto,0,Café,Coffee Shop,Cocktail Bar,Cosmetics Shop,Gastropub,Creperie,Lingerie Store,Beer Bar,Farmers Market,Restaurant
4,East Toronto,0,Park,Health Food Store,Trail,Pub,Wine Bar,Dumpling Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
5,Downtown Toronto,0,Coffee Shop,Bakery,Cocktail Bar,Seafood Restaurant,Restaurant,Beer Bar,Farmers Market,Pharmacy,Cheese Shop,Belgian Restaurant
6,Downtown Toronto,0,Coffee Shop,Sandwich Place,Italian Restaurant,Café,Salad Place,Burger Joint,Department Store,Japanese Restaurant,Thai Restaurant,Bubble Tea Shop
7,Downtown Toronto,0,Grocery Store,Café,Coffee Shop,Park,Candy Store,Nightclub,Bank,Baby Store,Restaurant,Athletics & Sports
8,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Deli / Bodega,Clothing Store,Thai Restaurant,Hotel,Gym,Salad Place,Sushi Restaurant
9,West Toronto,0,Pharmacy,Bakery,Music Venue,Middle Eastern Restaurant,Brazilian Restaurant,Bar,Café,Bank,Supermarket,Grocery Store
