# Coursera Applied Data Science Capstone - Week 3 - Segmenting and Clustering Neighborhoods in Toronto

## Part 1: Building Neighborhood Dataframe

Importing basic libraries

In [1]:
import numpy as np
import pandas as pd
import json

#### Collecting Data

First data about Toronto Neighborhoods needs to be collected from Wikapedia webpage: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.

The package Beautiful Soup is used to scrape the necassary information.

In [2]:
#Importing libraries to use BeautifulSoup and url use
from bs4 import BeautifulSoup

from urllib.request import urlopen


In [3]:
#Wikapedia page with information
url ="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

#Creating the soup
html = urlopen(url) 
soup = BeautifulSoup(html, 'html.parser')

#### Looking for data of interest

The data we are intrested in is included in a table on the webpage.

After looking thorugh the website code it is found that the table infomation is stored in the <table> tag.

This information is extracted.

In [4]:
soup = soup.find_all('table') #Looking at only the first table (the one we are interested in)
#soup 

#### Adding the data into a dataframe

Defining the dataframe

In [5]:
#Column names
column_names = ['PostalCode', 'Borough', 'Neighborhood']

#Creating the dataframe
toronto_data = pd.DataFrame(columns = column_names)
toronto_data

Unnamed: 0,PostalCode,Borough,Neighborhood


In the webpage the table data is containted in \<tr> for each row and \<td> for each cell. This information is read into the dataframe one cell at a time.

In [6]:
#adding data into the dataframe

postalcode = ''
borough = ''
neighborhood = ''
count = 0

for table in soup:
        row = table.find_all('tr')
        
        for each_row in row:
            row_cells = each_row.find_all('td')
            #print (row_cells)
            
            #Skipping any cells that dont contain any information 
            if len(row_cells) > 1:
                if count < 180:  # To keep the collection limited to the main table
                    count = count + 1
                    #print (count)

                    postal_code = row_cells[0].text.strip()
                    borough = row_cells[1].text.strip()
                    neighborhood = row_cells[2].text.strip()
                    
                    new_row = {'PostalCode' : postal_code, 'Borough' : borough, 'Neighborhood' : neighborhood}
                
                    #print (postal_code)
                    #print (borough)
                    #print (neighborhood)
                    #print(new_row)
            
                    toronto_data = toronto_data.append(new_row, ignore_index=True)

In [7]:
#Checking the table
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Cleaning up the table by:
  - Ignoring cells where Borough is Not assigned.
  - If a cell has a Borough but a Not assigned Neighborhood, then the Neighborhood will be set the same as the Borough.

In [8]:
#Dropping unnecessary rows
toronto_data_table = toronto_data[toronto_data.Borough != 'Not assigned']

#Replacing Neighborhood = 'Not assigned' with Neighborhood = Borough for necessary rows
for index, row in toronto_data_table.iterrows():
    if row['Neighborhood'] == 'Not assigned':
        row['Neighborhood'] == row['Borough']
    
#Reseting the index after dropping rows
toronto_data_table = toronto_data_table.reset_index(drop = True)
toronto_data_table.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [9]:
#Looking at the number of rows in the dataframe
toronto_data_table.shape

(103, 3)

## Part 2: Adding latitude and longitude

In [10]:
#Adding new columns to the dataframe
plus_new_columns = ['PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude']
toronto_table = toronto_data_table.reindex(columns = plus_new_columns)

#Replacing the NaN values in the Latitude and Latitude columns 
toronto_table['Latitude'] = toronto_table['Latitude'].fillna(0)
toronto_table['Longitude'] = toronto_table['Longitude'].fillna(0)

toronto_table.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,0.0,0.0
1,M4A,North York,Victoria Village,0.0,0.0
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",0.0,0.0
3,M6A,North York,"Lawrence Manor, Lawrence Heights",0.0,0.0
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",0.0,0.0


Read the latitude and longitude values from a given csv file.

In [11]:
url = 'http://cocl.us/Geospatial_data'
lat_long = pd.read_csv(url)
lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Iterate over the Latitude and Longitude data and toronto data, to match postal codes and transfer latitude and longitude values to the Toronto dataframe.

In [12]:
#Iterate over each row in Lat_Long table
for index1, ll_row in lat_long.iterrows():
    ll_code = ll_row['Postal Code']
    #Iterate over each row in Toronto Table
    for index2, t_row in toronto_table.iterrows():
        t_code = t_row['PostalCode']
        
        #If the postal codes are the same in the rows, then the Latitude and Longitude values are transfered to the proper row in the Toronto data table.
        if t_code == ll_code :            
            toronto_table.at[index2,'Latitude'] = ll_row['Latitude']
            toronto_table.at[index2,'Longitude'] = ll_row['Longitude']
        else:
            pass

            
toronto_table.head(100)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
95,M1X,Scarborough,Upper Rouge,43.836125,-79.205636
96,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675
97,M5X,Downtown Toronto,"First Canadian Place, Underground city",43.648429,-79.382280
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944


## Part 3: Exploring and Clustering the neighborhoods in Toronto

Now that all the neighboorhood and postion data is collected, we can explore, cluster, and analyze the clusters created by the k-means clustering method.

Importing libraries.

In [13]:
import json # library to handle JSON files

# convert an address into latitude and longitude values
#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim 

# library to handle requests
import requests
# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
!pip install folium==0.5.0
import folium

Collecting folium==0.5.0
  Downloading folium-0.5.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 7.9 MB/s  eta 0:00:01
[?25hCollecting branca
  Downloading branca-0.4.1-py3-none-any.whl (24 kB)
Building wheels for collected packages: folium
  Building wheel for folium (setup.py) ... [?25ldone
[?25h  Created wheel for folium: filename=folium-0.5.0-py3-none-any.whl size=76240 sha256=c9dbfcfef1ac29c4acb1cd2de9811fe57f6879781f26690bbd3d2824be6ee9ba
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/b2/2f/2c/109e446b990d663ea5ce9b078b5e7c1a9c45cca91f377080f8
Successfully built folium
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.5.0


Toronto's Latitude and Longitude

In [14]:
c_address = 'Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="canada_explorer")
location = geolocator.geocode(c_address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Looking at all the neighborhoods

In [15]:
# create a map
map_canada = folium.Map(location=[latitude, longitude], zoom_start=10)

# addding neighborhood markers to the map
for lat, lng, borough, neighborhood in zip(toronto_table['Latitude'], toronto_table['Longitude'], toronto_table['Borough'], toronto_table['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_canada)  
    
map_canada

For this exploration ONLY boroughs that contain the word Toronto. So filtering and reformatting the data.

In [16]:
#Filtering only the rows that contain Toronto in the Borough
only_toronto = toronto_table[toronto_table['Borough'].str.contains('Toronto')].reset_index(drop=True)

#Creating a new dataframe
all_toronto = pd.DataFrame(columns = ['PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'])

In [17]:
#If there are more than one neighboorhoods listed in one row, splitting them up ..... SHOULD I KEEP THIS
for index, row in only_toronto.iterrows():
    check = str(row['Neighborhood'])
    if ", " in check:
        split = check.split(', ')
        number = len(split)
        
        for i in range(number):
            one_neighborhood = split[i]

            
            new_row = {'PostalCode' : row['PostalCode'], 'Borough' : row['Borough'], 'Neighborhood' : one_neighborhood, 
                       'Latitude': row['Latitude'], 'Longitude' : row['Longitude']}
            all_toronto = all_toronto.append(new_row, ignore_index=True)
    else:
        all_toronto = all_toronto.append(row, ignore_index=True)

all_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,Regent Park,43.65426,-79.360636
1,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
2,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
3,M7A,Downtown Toronto,Ontario Provincial Government,43.662301,-79.389494
4,M5B,Downtown Toronto,Garden District,43.657162,-79.378937


Visualizing the neighboorhoods

In [18]:
# create a map
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# addding neighborhood markers to the map
for lat, lng, borough, neighborhood in zip(all_toronto['Latitude'], all_toronto['Longitude'], all_toronto['Borough'], all_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Now the Foursquare API can be used to explore the neighborhoods in Toronto

In [19]:
# The code was removed by Watson Studio for sharing.

Defining a function to get information from Foursquare about all the neighborhoods.

In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Running the function on the all_toronto table

In [22]:
toronto_venues = getNearbyVenues(names = all_toronto['Neighborhood'], latitudes = all_toronto['Latitude'], 
                                   longitudes = all_toronto['Longitude'])

Regent Park
Harbourfront
Queen's Park
Ontario Provincial Government
Garden District
Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond
Adelaide
King
Dufferin
Dovercourt Village
Harbourfront East
Union Station
Toronto Islands
Little Portugal
Trinity
The Danforth West
Riverdale
Toronto Dominion Centre
Design Exchange
Brockton
Parkdale Village
Exhibition Place
India Bazaar
The Beaches West
Commerce Court
Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
Forest Hill Road Park
High Park
The Junction South
North Toronto West
 Lawrence Park
The Annex
North Midtown
Yorkville
Parkdale
Roncesvalles
Davisville
University of Toronto
Harbord
Runnymede
Swansea
Moore Park
Summerhill East
Kensington Market
Chinatown
Grange Park
Summerhill West
Rathnelly
South Hill
Forest Hill SE
Deer Park
CN Tower
King and Spadina
Railway Lands
Harbourfront West
Bathurst Quay
South Niagara
Island airport
Rosedale
Stn A PO Boxes
St. James To

In [24]:
print(toronto_venues.shape)
toronto_venues.head()

(3199, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Regent Park,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Regent Park,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Regent Park,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,Regent Park,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
4,Regent Park,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


In [25]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Lawrence Park,18,18,18,18,18,18
Adelaide,100,100,100,100,100,100
Bathurst Quay,16,16,16,16,16,16
Berczy Park,55,55,55,55,55,55
Brockton,23,23,23,23,23,23
...,...,...,...,...,...,...
Underground city,100,100,100,100,100,100
Union Station,100,100,100,100,100,100
University of Toronto,34,34,34,34,34,34
Victoria Hotel,100,100,100,100,100,100


In [26]:
#How many unique catigories
print('There are {} unique Neighborhoods.'.format(len(toronto_venues['Neighborhood'].unique())))
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 77 unique Neighborhoods.
There are 236 unique categories.


We can use One Hot Encoding to make the data easier to work with.  

In [27]:
# one hot encoding
toronto_venues_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
neighborhoods = toronto_venues['Neighborhood'] 
toronto_venues_onehot.insert (0, "The Neighborhoods", neighborhoods)

toronto_venues_onehot.head()

Unnamed: 0,The Neighborhoods,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now with all this data can be manuvered to allow for better analysis during clustering
- First grouped the data by neighborhood and found the mean frequency of each type of venue
- Created a new dataframe to display the top 10 venues by neighborhood

In [28]:
toronto_grouped = toronto_venues_onehot.groupby('The Neighborhoods').mean().reset_index()
toronto_grouped

Unnamed: 0,The Neighborhoods,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,Lawrence Park,0.0,0.0000,0.0000,0.0000,0.000,0.000,0.0000,0.00,0.0,...,0.0,0.00,0.0,0.00,0.000000,0.000000,0.0,0.00,0.00,0.055556
1,Adelaide,0.0,0.0000,0.0000,0.0000,0.000,0.000,0.0000,0.02,0.0,...,0.0,0.01,0.0,0.00,0.010000,0.000000,0.0,0.00,0.01,0.000000
2,Bathurst Quay,0.0,0.0625,0.0625,0.0625,0.125,0.125,0.0625,0.00,0.0,...,0.0,0.00,0.0,0.00,0.000000,0.000000,0.0,0.00,0.00,0.000000
3,Berczy Park,0.0,0.0000,0.0000,0.0000,0.000,0.000,0.0000,0.00,0.0,...,0.0,0.00,0.0,0.00,0.018182,0.000000,0.0,0.00,0.00,0.000000
4,Brockton,0.0,0.0000,0.0000,0.0000,0.000,0.000,0.0000,0.00,0.0,...,0.0,0.00,0.0,0.00,0.000000,0.000000,0.0,0.00,0.00,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,Underground city,0.0,0.0000,0.0000,0.0000,0.000,0.000,0.0000,0.03,0.0,...,0.0,0.00,0.0,0.01,0.010000,0.000000,0.0,0.01,0.00,0.000000
73,Union Station,0.0,0.0000,0.0000,0.0000,0.000,0.000,0.0000,0.00,0.0,...,0.0,0.00,0.0,0.01,0.010000,0.000000,0.0,0.01,0.00,0.000000
74,University of Toronto,0.0,0.0000,0.0000,0.0000,0.000,0.000,0.0000,0.00,0.0,...,0.0,0.00,0.0,0.00,0.000000,0.029412,0.0,0.00,0.00,0.029412
75,Victoria Hotel,0.0,0.0000,0.0000,0.0000,0.000,0.000,0.0000,0.04,0.0,...,0.0,0.00,0.0,0.00,0.020000,0.000000,0.0,0.01,0.00,0.000000


This function sorts the venues in descending order.

In [29]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [30]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Neighborhood'] = toronto_grouped['The Neighborhoods']

for ind in np.arange(toronto_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Lawrence Park,Coffee Shop,Clothing Store,Yoga Studio,Bagel Shop,Furniture / Home Store,Ice Cream Shop,Fast Food Restaurant,Diner,Mexican Restaurant,Chinese Restaurant
1,Adelaide,Coffee Shop,Café,Hotel,Gym,Restaurant,Clothing Store,Thai Restaurant,Bar,Bookstore,Burrito Place
2,Bathurst Quay,Airport Lounge,Airport Service,Boutique,Harbor / Marina,Boat or Ferry,Rental Car Location,Bar,Plane,Coffee Shop,Sculpture Garden
3,Berczy Park,Coffee Shop,Seafood Restaurant,Cocktail Bar,Farmers Market,Beer Bar,Restaurant,Cheese Shop,Bakery,Sandwich Place,Department Store
4,Brockton,Café,Breakfast Spot,Nightclub,Coffee Shop,Climbing Gym,Burrito Place,Restaurant,Italian Restaurant,Intersection,Bar


Now k-means can be used to cluster the neighborhoods into clusters.
Here it is specified as 5 clusters.

In [31]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('The Neighborhoods', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 2, 0, 0, 0, 2, 0, 0, 0], dtype=int32)

With the cluster information, a dataframe is created to include the cluster and venue inforamtion.

In [34]:
# add clustering labels
venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = all_toronto

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,Regent Park,43.65426,-79.360636,0,Coffee Shop,Park,Pub,Bakery,Theater,Breakfast Spot,Café,Electronics Store,Spa,Beer Store
1,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,0,Coffee Shop,Park,Pub,Bakery,Theater,Breakfast Spot,Café,Electronics Store,Spa,Beer Store
2,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494,0,Coffee Shop,Yoga Studio,Diner,Restaurant,Portuguese Restaurant,Park,Music Venue,Mexican Restaurant,Italian Restaurant,Hobby Shop
3,M7A,Downtown Toronto,Ontario Provincial Government,43.662301,-79.389494,0,Coffee Shop,Yoga Studio,Diner,Restaurant,Portuguese Restaurant,Park,Music Venue,Mexican Restaurant,Italian Restaurant,Hobby Shop
4,M5B,Downtown Toronto,Garden District,43.657162,-79.378937,0,Clothing Store,Coffee Shop,Café,Bubble Tea Shop,Japanese Restaurant,Cosmetics Shop,Hotel,Bookstore,Pizza Place,Middle Eastern Restaurant


Time to visualize the clusters.

In [35]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

The Clusters can be examined to determine the distinguishing features of each cluster.

#### Cluster 1:
This cluster contains the majority of the neighborhoods. Based on the visible rows, the top venues across all the neighboorhoods are Coffee Shop, Park, Pub, Gym, and a variety of restraunts.

In [36]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Regent Park,0,Coffee Shop,Park,Pub,Bakery,Theater,Breakfast Spot,Café,Electronics Store,Spa,Beer Store
1,Harbourfront,0,Coffee Shop,Park,Pub,Bakery,Theater,Breakfast Spot,Café,Electronics Store,Spa,Beer Store
2,Queen's Park,0,Coffee Shop,Yoga Studio,Diner,Restaurant,Portuguese Restaurant,Park,Music Venue,Mexican Restaurant,Italian Restaurant,Hobby Shop
3,Ontario Provincial Government,0,Coffee Shop,Yoga Studio,Diner,Restaurant,Portuguese Restaurant,Park,Music Venue,Mexican Restaurant,Italian Restaurant,Hobby Shop
4,Garden District,0,Clothing Store,Coffee Shop,Café,Bubble Tea Shop,Japanese Restaurant,Cosmetics Shop,Hotel,Bookstore,Pizza Place,Middle Eastern Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...
73,First Canadian Place,0,Coffee Shop,Café,Gym,Hotel,Japanese Restaurant,Restaurant,Deli / Bodega,Salad Place,Seafood Restaurant,Asian Restaurant
74,Underground city,0,Coffee Shop,Café,Gym,Hotel,Japanese Restaurant,Restaurant,Deli / Bodega,Salad Place,Seafood Restaurant,Asian Restaurant
75,Church and Wellesley,0,Coffee Shop,Japanese Restaurant,Gay Bar,Sushi Restaurant,Restaurant,Pub,Men's Store,Mediterranean Restaurant,Hotel,Yoga Studio
76,Business reply mail Processing Centre,0,Park,Pizza Place,Light Rail Station,Skate Park,Burrito Place,Farmers Market,Fast Food Restaurant,Butcher,Restaurant,Recording Studio


#### Cluster 2:
In this cluster the first two neighborhoods have exactly the same venues in the same ranking. The third neighborhood contains almost the same venues (with some differences) with differences in rank.

In [37]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
36,Forest Hill North & West,1,Park,Sushi Restaurant,Jewelry Store,Trail,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
37,Forest Hill Road Park,1,Park,Sushi Restaurant,Jewelry Store,Trail,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
69,Rosedale,1,Park,Trail,Playground,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


#### Cluster 3: 
This cluster can be catigorized as containing the airport and harbor. The top venues include those that tavelers or commuters would require.

In [38]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
62,CN Tower,2,Airport Lounge,Airport Service,Boutique,Harbor / Marina,Boat or Ferry,Rental Car Location,Bar,Plane,Coffee Shop,Sculpture Garden
63,King and Spadina,2,Airport Lounge,Airport Service,Boutique,Harbor / Marina,Boat or Ferry,Rental Car Location,Bar,Plane,Coffee Shop,Sculpture Garden
64,Railway Lands,2,Airport Lounge,Airport Service,Boutique,Harbor / Marina,Boat or Ferry,Rental Car Location,Bar,Plane,Coffee Shop,Sculpture Garden
65,Harbourfront West,2,Airport Lounge,Airport Service,Boutique,Harbor / Marina,Boat or Ferry,Rental Car Location,Bar,Plane,Coffee Shop,Sculpture Garden
66,Bathurst Quay,2,Airport Lounge,Airport Service,Boutique,Harbor / Marina,Boat or Ferry,Rental Car Location,Bar,Plane,Coffee Shop,Sculpture Garden
67,South Niagara,2,Airport Lounge,Airport Service,Boutique,Harbor / Marina,Boat or Ferry,Rental Car Location,Bar,Plane,Coffee Shop,Sculpture Garden
68,Island airport,2,Airport Lounge,Airport Service,Boutique,Harbor / Marina,Boat or Ferry,Rental Car Location,Bar,Plane,Coffee Shop,Sculpture Garden


#### Cluster 4:

In [39]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
52,Moore Park,3,Playground,Trail,Yoga Studio,Department Store,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
53,Summerhill East,3,Playground,Trail,Yoga Studio,Department Store,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


#### Cluster 5: 
This cluster only contains one neighborhood (Lawrence Park).

In [40]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
33,Lawrence Park,4,Park,Swim School,Bus Line,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
