# Segmenting and Clustering Neighborhoods in Toronto Assignment

In this notebook, we am going to explore and cluster the neighborhoods in Toronto based on the data found at the following URL:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

## Assumptions
We assume that all necessary packages have been installed and are available to use. In our setup, we use conda to manage our development environment and python package management.

## Part A
Part A of the assignment refers to questions 1-4. The result of this part is a dataframe like the one presented in the assignment description. 

In [164]:
# import necessary libaries
import pandas as pd
import requests
from bs4 import BeautifulSoup

# read webpage content and convert table data to pandas dataframe
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'html.parser')
table = soup.find_all('table')[0] # search far all tables in page and keep the first one
table # show table data

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Harbourfront</a>
</td></tr>
<tr>
<td>M6A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Lawrence_Heights" title="Lawrence Heights">Lawrence Heights</a>
</td></tr>
<tr>
<td>M6A</td>
<td><a href="/wiki/North

In [2]:
# convert html table to pandas dataframe
df = pd.read_html(str(table))[0] # read_html returns a list by default
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


In [3]:
# we drop cells that do not have an assigned borough
df.drop(df[ df['Borough'] == 'Not assigned' ].index , inplace=True)
df

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
...,...,...,...
281,M8Z,Etobicoke,Kingsway Park South West
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So, we process the dataframe in order to meet this guideline.

In [4]:
df['Neighbourhood'].loc[df['Neighbourhood'] == 'Not assigned'] = df['Borough']
df

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
...,...,...,...
281,M8Z,Etobicoke,Kingsway Park South West
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West


More than one neighborhood can exist in one postal code area. We will process the dataframe in order to combine them to one row with the neighborhoods separated with a comma.
We also re-index the dataframe.

In [5]:
df=df.groupby(['Postcode', 'Borough'])['Neighbourhood'].agg([('Neighbourhood', ', '.join)]).reset_index()
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


Finally, we present the number of rows and columns of the dataframe.

In [6]:
df.shape

(103, 3)

## Part B
Part B refers to the second part of the assignment in which by using the Geocoder package or the csv file, we enrich the dataframe from Part A to include latitude and longitude for each neighbourhood.

### Coordinates extraction using the Geocoder package
In this approach, we will use the Geocoder python package to extract the coordinates for each neighbourhood.

In [None]:
import geocoder # import geocoder

# we iterate over all postal codes in our dataframe 
for postal_code in df['Postcode']:
    print('Getting coordinates for postal code {}'.format(postal_code))
    # reset local variable
    lat_lng_coords = None
    
    # loop until we get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    
    df['Latitude'] = latitude
    df['Longitude'] = longitude
    
    print('Cooordinates for postal code {} are {}.{}'.format(postal_code, latitude, longitude))
    
print('Cooordinates extraction finished!')

Using the Geocoder package and the code above, we were unable to get the coordinates for all neighbourhoods. After a point, its stopped returning valid values. So, in order to get the coordinates, we will use the CSV file available in the assignment description.

### Coordinates extraction using the CSV file
As mentioned above, we will use the CSV file to get the coordinates to enrich our dataframe.

In [8]:
# download the data
!wget -q -O 'toronto_coords.csv' http://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


In [9]:
# create a second datarame with coordinates
df_coords = pd.read_csv('toronto_coords.csv')
df_coords

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [10]:
# rename column name
df_coords.rename(columns={'Postal Code': 'Postcode'}, inplace=True)
df_coords

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [11]:
# we merge our two dataframes based on the Postcode
df_final = df.merge(df_coords, on='Postcode')
df_final

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437


We see that the final dataframe is similar to the one presented in the assignment description and includes all the necesasry data.

## Part C
In this part, we will explore and cluster the neighbourhoods in Toronto. We will work with only boroughs that contain the word Toronto and then we will replicate the same analysis that was presented in the lab for New York City data.

In [72]:
# import all necessary libraries

import numpy as np

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
import folium 

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

Create a new dataframe that includes only neighbourhoods that their Borough includes the word 'Toronto'

In [13]:
# We keep only the neighbourhoods in Toronto
df_toronto = df_final[df_final['Borough'].str.contains('Toronto')].reset_index(drop=True)
df_toronto

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049


Use geopy library to get the latitude and longitude values of Toronto.

In [16]:
# Get the coordinates of Toronto
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


Create a map of Toronto with neighbourhoods superimposed on top.

In [17]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In the next steps, we are going to use the Foursquare API in order to explore the neighbourhoods and segment them.

In [18]:
CLIENT_ID = 'your-client-id' # your Foursquare ID
CLIENT_SECRET = 'your-client-secret' # your Foursquare Secret

VERSION = '20191221' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: WRGEA2A22EFILODGVHQEKCHXOP0GTQV5QS4YQ23J2B2CKPXX
CLIENT_SECRET:J1GYEZEQ1HMQRQNQIL5TMU2ANXQMHZWAR0RRZN55JDM5A35O


We explore the first of the neighbourhoods of our dataset.

In [19]:
neighborhood_latitude = df_toronto.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_toronto.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = df_toronto.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of The Beaches are 43.67635739999999, -79.2930312.


We get the top 100 venues that are in The Beaches within a radius of 500 meters.

In [20]:
# set the necessary params
radius = 500
LIMIT = 100
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, radius, LIMIT)
# get results and convert into JSON
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e21a825542890001bc52520'},
 'response': {'headerLocation': 'The Beaches',
  'headerFullLocation': 'The Beaches, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.680857404499996,
    'lng': -79.28682091449052},
   'sw': {'lat': 43.67185739549999, 'lng': -79.29924148550948}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd461bc77b29c74a07d9282',
       'name': 'Glen Manor Ravine',
       'location': {'address': 'Glen Manor',
        'crossStreet': 'Queen St.',
        'lat': 43.67682094413784,
        'lng': -79.29394208780985,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.67682094413784,
          'lng': -79.29394208780985}],
        'distanc

We use the function get_category_type from the lab to extract the category of each venue.

In [21]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Then, we process the information and load it into a dataframe.

In [22]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Upper Beaches,Neighborhood,43.680563,-79.292869


Finally, we present the numner of venues returned by Foursquare

In [23]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


### Explore neighbourhoods in Downtown Toronto

Now, we will do the same for all neighbourhoods in Downtown Toronto.

First, we will use the same function presetned in the lab.

In [55]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
#         print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Next, we get all the distinct neighbourhoods in Downtown Toronto.

In [56]:
df_downtown_toronto = df_final[df_final['Borough'] == 'Downtown Toronto'].reset_index(drop=True)

dt_toronto_neighbourhoods = pd.DataFrame(columns=['Neighbourhood', 'Latitude', 'Longitude'])
for name, lat, long in zip(df_downtown_toronto['Neighbourhood'], df_downtown_toronto['Latitude'], df_downtown_toronto['Longitude']):
    if len(name) > 1:
        nbs = name.split(',')
        for nb_name in nbs:
            dt_toronto_neighbourhoods = dt_toronto_neighbourhoods.append({'Neighbourhood': nb_name.strip(), 
                                                                          'Latitude': lat, 
                                                                          'Longitude': long}, ignore_index=True)

            

We get all the venues in Downtown Toronto.

In [58]:
dt_toronto_venues = getNearbyVenues(names=dt_toronto_neighbourhoods['Neighbourhood'],
                                    latitudes=dt_toronto_neighbourhoods['Latitude'],
                                    longitudes=dt_toronto_neighbourhoods['Longitude'])

print(dt_toronto_venues.shape)
dt_toronto_venues.head()

(2438, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rosedale,43.679563,-79.377529,Rosedale Park,43.682328,-79.378934,Playground
1,Rosedale,43.679563,-79.377529,Whitney Park,43.682036,-79.373788,Park
2,Rosedale,43.679563,-79.377529,Alex Murray Parkette,43.6783,-79.382773,Park
3,Rosedale,43.679563,-79.377529,Milkman's Lane,43.676352,-79.373842,Trail
4,Cabbagetown,43.667967,-79.367675,Cranberries,43.667843,-79.369407,Diner


Next, we check how many venues were returned by neighbourhood.

In [59]:
dt_toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelaide,100,100,100,100,100,100
Bathurst Quay,12,12,12,12,12,12
Berczy Park,56,56,56,56,56,56
CN Tower,12,12,12,12,12,12
Cabbagetown,44,44,44,44,44,44
Central Bay Street,83,83,83,83,83,83
Chinatown,91,91,91,91,91,91
Christie,17,17,17,17,17,17
Church and Wellesley,81,81,81,81,81,81
Commerce Court,100,100,100,100,100,100


The unique categories of those venues are presented below:

In [62]:
print('There are {} uniques categories.'.format(len(dt_toronto_venues['Venue Category'].unique())))

There are 205 uniques categories.


### Analyze each neighbourhood

In [63]:
# one hot encoding
dt_toronto_onehot = pd.get_dummies(dt_toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
dt_toronto_onehot['Neighbourhood'] = dt_toronto_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = [dt_toronto_onehot.columns[-1]] + list(dt_toronto_onehot.columns[:-1])
dt_toronto_onehot = dt_toronto_onehot[fixed_columns]

dt_toronto_onehot.head()

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Rosedale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Rosedale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Rosedale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Rosedale,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,Cabbagetown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [65]:
dt_toronto_onehot.shape

(2438, 206)

Next, we group rows by neighbourhood and by taking the mean of the frequency of occurrence of each category.

In [67]:
dt_toronto_grouped = dt_toronto_onehot.groupby('Neighbourhood').mean().reset_index()
dt_toronto_grouped

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0
1,Bathurst Quay,0.0,0.083333,0.083333,0.166667,0.166667,0.166667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0
3,CN Tower,0.0,0.083333,0.083333,0.166667,0.166667,0.166667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Cabbagetown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.012048,0.0,0.0,...,0.0,0.0,0.0,0.012048,0.0,0.0,0.012048,0.0,0.0,0.012048
6,Chinatown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.054945,0.0,0.054945,0.010989,0.0,0.0,0.0
7,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Church and Wellesley,0.012346,0.0,0.0,0.0,0.0,0.0,0.012346,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.012346,0.0,0.012346,0.0,0.012346
9,Commerce Court,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0


The new size after the grouping is presented below:

In [74]:
dt_toronto_grouped.shape

(36, 206)

We print each neighborhood along with the top 5 most common venue.

In [70]:
num_top_venues = 5

for hood in dt_toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = dt_toronto_grouped[dt_toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide----
              venue  freq
0       Coffee Shop  0.07
1              Café  0.04
2        Steakhouse  0.04
3               Bar  0.04
4  Asian Restaurant  0.03


----Bathurst Quay----
                venue  freq
0      Airport Lounge  0.17
1     Airport Service  0.17
2    Airport Terminal  0.17
3            Boutique  0.08
4  Airport Food Court  0.08


----Berczy Park----
          venue  freq
0   Coffee Shop  0.07
1  Cocktail Bar  0.05
2      Beer Bar  0.04
3          Café  0.04
4   Cheese Shop  0.04


----CN Tower----
                venue  freq
0      Airport Lounge  0.17
1     Airport Service  0.17
2    Airport Terminal  0.17
3            Boutique  0.08
4  Airport Food Court  0.08


----Cabbagetown----
         venue  freq
0   Restaurant  0.07
1  Coffee Shop  0.07
2         Park  0.05
3         Café  0.05
4       Bakery  0.05


----Central Bay Street----
                 venue  freq
0          Coffee Shop  0.16
1                 Café  0.05
2   Italian Restaurant  0.05
3

We use the function from the lab to sort the venues in descending order.

In [75]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

We create the new dataframe and display the top 10 venues for each neighbourhood.

In [117]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] =  dt_toronto_grouped['Neighbourhood']

for ind in np.arange(dt_toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dt_toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Adelaide,Coffee Shop,Bar,Café,Steakhouse,Asian Restaurant,Hotel,Clothing Store,Burger Joint,Thai Restaurant,Sushi Restaurant
1,Bathurst Quay,Airport Terminal,Airport Lounge,Airport Service,Boutique,Sculpture Garden,Harbor / Marina,Boat or Ferry,Airport Food Court,Airport,Farmers Market
2,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Café,Steakhouse,Farmers Market,Cheese Shop,Seafood Restaurant,Beer Bar,Bistro
3,CN Tower,Airport Terminal,Airport Lounge,Airport Service,Boutique,Sculpture Garden,Harbor / Marina,Boat or Ferry,Airport Food Court,Airport,Farmers Market
4,Cabbagetown,Restaurant,Coffee Shop,Pub,Pizza Place,Park,Bakery,Italian Restaurant,Café,Diner,Pet Store


### Cluster neighbourhoods

Run k-means to cluster the neighbourhoods into 4 clusters.

In [126]:
# set number of clusters
kclusters = 4

dt_toronto_grouped_clustering = dt_toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dt_toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 0, 1, 0, 1, 1, 2, 2, 1, 1], dtype=int32)

We create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [127]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dt_toronto_merged = dt_toronto_neighbourhoods # init the dataframe with the lat/long of Downtown Toronto neighbourhoods

dt_toronto_merged = dt_toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

dt_toronto_merged.head() 


Unnamed: 0,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Rosedale,43.679563,-79.377529,3,Park,Playground,Trail,Dessert Shop,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
1,Cabbagetown,43.667967,-79.367675,1,Restaurant,Coffee Shop,Pub,Pizza Place,Park,Bakery,Italian Restaurant,Café,Diner,Pet Store
2,St. James Town,43.667967,-79.367675,1,Coffee Shop,Café,Restaurant,Bakery,Diner,Breakfast Spot,Park,Italian Restaurant,Gastropub,Cosmetics Shop
3,Church and Wellesley,43.66586,-79.38316,1,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Mediterranean Restaurant,Gym,Hotel,Café,Pub
4,Harbourfront,43.65426,-79.360636,1,Coffee Shop,Park,Pub,Bakery,Breakfast Spot,Café,Restaurant,Mexican Restaurant,Brewery,Event Space


We create a map to visualize the resulting clusters.

In [128]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dt_toronto_merged['Latitude'], dt_toronto_merged['Longitude'], dt_toronto_merged['Neighbourhood'], dt_toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine clusters

We examine the clusters to see what discriminates them.

In [160]:
dt_toronto_merged.loc[dt_toronto_merged['Cluster Labels'] == 0,
                      dt_toronto_merged.columns[[0] + list(range(4, dt_toronto_merged.shape[1]))]]


Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25,CN Tower,Airport Terminal,Airport Lounge,Airport Service,Boutique,Sculpture Garden,Harbor / Marina,Boat or Ferry,Airport Food Court,Airport,Farmers Market
26,Bathurst Quay,Airport Terminal,Airport Lounge,Airport Service,Boutique,Sculpture Garden,Harbor / Marina,Boat or Ferry,Airport Food Court,Airport,Farmers Market
27,Island airport,Airport Terminal,Airport Lounge,Airport Service,Boutique,Sculpture Garden,Harbor / Marina,Boat or Ferry,Airport Food Court,Airport,Farmers Market
28,Harbourfront West,Airport Terminal,Airport Lounge,Airport Service,Boutique,Sculpture Garden,Harbor / Marina,Boat or Ferry,Airport Food Court,Airport,Farmers Market
29,King and Spadina,Airport Terminal,Airport Lounge,Airport Service,Boutique,Sculpture Garden,Harbor / Marina,Boat or Ferry,Airport Food Court,Airport,Farmers Market
30,Railway Lands,Airport Terminal,Airport Lounge,Airport Service,Boutique,Sculpture Garden,Harbor / Marina,Boat or Ferry,Airport Food Court,Airport,Farmers Market
31,South Niagara,Airport Terminal,Airport Lounge,Airport Service,Boutique,Sculpture Garden,Harbor / Marina,Boat or Ferry,Airport Food Court,Airport,Farmers Market


In [161]:
# no venues in cluster 1
dt_toronto_merged.loc[dt_toronto_merged['Cluster Labels'] == 1, 
                      dt_toronto_merged.columns[[0] + list(range(4, dt_toronto_merged.shape[4]))]]


IndexError: tuple index out of range

In [162]:
dt_toronto_merged.loc[dt_toronto_merged['Cluster Labels'] == 2, 
                      dt_toronto_merged.columns[[0] + list(range(4, dt_toronto_merged.shape[1]))]]


Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,Harbord,Café,Japanese Restaurant,Italian Restaurant,Bookstore,Sandwich Place,Restaurant,Bakery,Bar,Dessert Shop,Chinese Restaurant
21,University of Toronto,Café,Japanese Restaurant,Italian Restaurant,Bookstore,Sandwich Place,Restaurant,Bakery,Bar,Dessert Shop,Chinese Restaurant
22,Chinatown,Café,Vietnamese Restaurant,Vegetarian / Vegan Restaurant,Dumpling Restaurant,Coffee Shop,Chinese Restaurant,Mexican Restaurant,Bar,Bakery,Comfort Food Restaurant
23,Grange Park,Café,Vietnamese Restaurant,Vegetarian / Vegan Restaurant,Dumpling Restaurant,Coffee Shop,Chinese Restaurant,Mexican Restaurant,Bar,Bakery,Comfort Food Restaurant
24,Kensington Market,Café,Vietnamese Restaurant,Vegetarian / Vegan Restaurant,Dumpling Restaurant,Coffee Shop,Chinese Restaurant,Mexican Restaurant,Bar,Bakery,Comfort Food Restaurant
35,Christie,Grocery Store,Café,Park,Italian Restaurant,Athletics & Sports,Nightclub,Diner,Restaurant,Coffee Shop,Baby Store


In [163]:
dt_toronto_merged.loc[dt_toronto_merged['Cluster Labels'] == 3, 
                      dt_toronto_merged.columns[[0] + list(range(4, dt_toronto_merged.shape[1]))]]


Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Rosedale,Park,Playground,Trail,Dessert Shop,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


### Final comments

From the information we get for each cluster, we see that the most common venues are the reason that discriminiates them. For example, we see that the neighbourhoods near the airport have have a complete match for all top 10 venues. 