# Segmenting and Clustering Neighborhoods in Toronto


### Instructions
In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.

2 -  Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

## 1. Download and Explore Dataset

In [1]:
#!conda install -c conda-forge beautifulsoup4 --yes

In [2]:
# import libraries
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

In [3]:
# specify the url
post_codes = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [4]:
# query the website and return the html to the variable ‘page’
page = requests.get(post_codes, timeout=5)

In [5]:
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page.content, 'html.parser')

In [6]:
code_table = soup.find('table')
code_rows = code_table.findAll('tr')
columns=['Postcode', 'Borough', 'Neighbourhood']
df_codes = pd.DataFrame(columns=columns)
for idx, val in enumerate(code_rows):
    code_cells = val.findAll('td')
    df_list = []
    for idx, val in enumerate(code_cells):
        df_list.append(val.text.rstrip())
    if(int(len(df_list)) > 0):
        if(df_list[1] != "Not assigned"):
            if(df_list[2] == "Not assigned"):
                df_list[2] = df_list[1]

            df_dic={columns[0]: df_list[0], columns[1]: df_list[1], columns[2]: df_list[2]}
            df_codes = df_codes.append(df_dic, ignore_index=True)

df_codes = df_codes.groupby('Postcode', as_index=False).agg(lambda x: ', '.join(set(x.dropna())))                

df_codes.shape   

(103, 3)

In [7]:
!wget -O Geospatial_data.csv https://cocl.us/Geospatial_data

--2018-10-19 05:14:36--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 169.48.113.201
Connecting to cocl.us (cocl.us)|169.48.113.201|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2018-10-19 05:14:36--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.26.197
Connecting to ibm.box.com (ibm.box.com)|107.152.26.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2018-10-19 05:14:37--  https://ibm.ent.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.ent.box.com (ibm.ent.box.com)... 107.152.27.211
Connecting to ibm.ent.box.com (ibm.ent.box.com)|107.152.27.211|:443... connected.
HTTP request sent, awaiting response... 302 Found

In [8]:
df_geo = pd.read_csv("Geospatial_data.csv")
df_codes = df_codes.join(df_geo)

In [9]:
df_codes

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Morningside, Guildwood, West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,M1J,43.744734,-79.239476
6,M1K,Scarborough,"Ionview, Kennedy Park, East Birchmount Park",M1K,43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",M1L,43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Scarborough Village West, Cliffcrest",M1M,43.716316,-79.239476
9,M1N,Scarborough,"Cliffside West, Birch Cliff",M1N,43.692657,-79.264848


## Explore and cluster the neighborhoods in the city of Toronto.

In [10]:
!pip install geopy
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library

[31mdistributed 1.21.8 requires msgpack, which is not installed.[0m
Solving environment: done

# All requested packages already installed.



Use geopy library to get the latitude and longitude values of Toronto.

In [11]:
address = 'Toronto, Canada'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Toronto are 43.653963, -79.387207.


Let's visualize Toronto neighborhoods.

In [12]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)
folium.TileLayer('openstreetmap').add_to(map_toronto)
# add markers to map
for lat, lng, label in zip(df_codes['Latitude'], df_codes['Longitude'], df_codes['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [13]:
CLIENT_ID = '0DXAVIXJUGYNEZNM4E1XVTFKNUMWS1TJ5ZMHGXRP2LZ4O15G' # your Foursquare ID
CLIENT_SECRET = 'M2253MDW32HJB43ZN1P4FKKWAVI5FXLFQYKYYWAWLK0F0IWC' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 0DXAVIXJUGYNEZNM4E1XVTFKNUMWS1TJ5ZMHGXRP2LZ4O15G
CLIENT_SECRET:M2253MDW32HJB43ZN1P4FKKWAVI5FXLFQYKYYWAWLK0F0IWC


#### Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [14]:
df_codes.loc[0, 'Neighbourhood']

'Malvern, Rouge'

Get the neighborhood's latitude and longitude values.

In [15]:
neighborhood_latitude = df_codes.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_codes.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = df_codes.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Malvern, Rouge are 43.806686299999996, -79.19435340000001.


#### Now, let's get the top 100 venues that are in Marble Hill within a radius of 500 meters.

First, let's create the GET request URL. Name your URL **url**.

In [16]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=0DXAVIXJUGYNEZNM4E1XVTFKNUMWS1TJ5ZMHGXRP2LZ4O15G&client_secret=M2253MDW32HJB43ZN1P4FKKWAVI5FXLFQYKYYWAWLK0F0IWC&v=20180605&ll=43.806686299999996,-79.19435340000001&radius=500&limit=100'

Send the GET request and examine the resutls

In [17]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5bc968b29fb6b75291be034d'},
  'headerLocation': 'Malvern',
  'headerFullLocation': 'Malvern, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 1,
  'suggestedBounds': {'ne': {'lat': 43.8111863045, 'lng': -79.18812958073042},
   'sw': {'lat': 43.80218629549999, 'lng': -79.2005772192696}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bb6b9446edc76b0d771311c',
       'name': "Wendy's",
       'location': {'crossStreet': 'Morningside & Sheppard',
        'lat': 43.80744841934756,
        'lng': -79.19905558052072,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.80744841934756,
          'lng': -79.19905558052072}],
        'distance': 387,
        'cc': 'CA',
        'city': 'Toronto',
    

From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [18]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [19]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Wendy's,Fast Food Restaurant,43.807448,-79.199056


## 2. Explore Neighborhoods in Toronto

#### Let's create a function to repeat the same process to all the neighborhoods in Toronto

In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now run the above function on each neighborhood and create a new dataframe called *toronto_venues*.

In [23]:
toronto_venues = getNearbyVenues(names=df_codes['Neighbourhood'],
                                   latitudes=df_codes['Latitude'],
                                   longitudes=df_codes['Longitude']
                                  )

Malvern, Rouge
Highland Creek, Rouge Hill, Port Union
Morningside, Guildwood, West Hill
Woburn
Cedarbrae
Scarborough Village
Ionview, Kennedy Park, East Birchmount Park
Golden Mile, Clairlea, Oakridge
Cliffside, Scarborough Village West, Cliffcrest
Cliffside West, Birch Cliff
Scarborough Town Centre, Wexford Heights, Dorset Park
Wexford, Maryvale
Agincourt
Sullivan, Clarks Corners, Tam O'Shanter
Agincourt North, Steeles East, L'Amoreaux East, Milliken
L'Amoreaux West, Steeles West
Upper Rouge
Hillcrest Village
Fairview, Oriole, Henry Farm
Bayview Village
Silver Hills, York Mills
Willowdale, Newtonbrook
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Don Mills South, Flemingdon Park
Downsview North, Wilson Heights, Bathurst Manor
Northwood Park, York University
CFB Toronto, Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens, Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
River

#### Let's check the size of the resulting dataframe

In [24]:
print(toronto_venues.shape)
toronto_venues.head()

(2254, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Affordable Toronto Movers,43.787919,-79.162977,Moving Target
3,"Morningside, Guildwood, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,"Morningside, Guildwood, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


Let's check how many venues were returned for each neighborhood

In [25]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, Richmond, King",100,100,100,100,100,100
Agincourt,4,4,4,4,4,4
"Agincourt North, Steeles East, L'Amoreaux East, Milliken",3,3,3,3,3,3
"Alderwood, Long Branch",10,10,10,10,10,10
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",26,26,26,26,26,26
Berczy Park,55,55,55,55,55,55
Business reply mail Processing Centre969 Eastern,18,18,18,18,18,18
"CFB Toronto, Downsview East",3,3,3,3,3,3
Caledonia-Fairbanks,6,6,6,6,6,6


#### Let's find out how many unique categories can be curated from all the returned venues

In [26]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 276 uniques categories.


## 3. Analyze Each Neighborhood

In [27]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [28]:
toronto_onehot.shape

(2254, 276)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [29]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,"Adelaide, Richmond, King",0.000000,0.01,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.010000,0.000000,0.000000,0.000000,0.000000,0.010000,0.000000,0.0,0.010000
1,Agincourt,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
2,"Agincourt North, Steeles East, L'Amoreaux East...",0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
3,"Alderwood, Long Branch",0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
4,Bayview Village,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
5,"Bedford Park, Lawrence Manor East",0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
6,Berczy Park,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
7,Business reply mail Processing Centre969 Eastern,0.055556,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
8,"CFB Toronto, Downsview East",0.000000,0.00,0.000000,0.000000,0.333333,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
9,Caledonia-Fairbanks,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.166667


#### Let's confirm the new size

In [30]:
toronto_grouped.shape

(101, 276)

#### Let's print each neighborhood along with the top 5 most common venues

In [31]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, Richmond, King----
                 venue  freq
0          Coffee Shop  0.07
1                 Café  0.06
2           Steakhouse  0.04
3  American Restaurant  0.04
4      Thai Restaurant  0.04


----Agincourt----
            venue  freq
0          Lounge  0.25
1  Clothing Store  0.25
2    Skating Rink  0.25
3  Breakfast Spot  0.25
4     Yoga Studio  0.00


----Agincourt North, Steeles East, L'Amoreaux East, Milliken----
           venue  freq
0     Playground  0.33
1    Coffee Shop  0.33
2           Park  0.33
3    Yoga Studio  0.00
4  Metro Station  0.00


----Alderwood, Long Branch----
            venue  freq
0     Pizza Place   0.2
1        Pharmacy   0.1
2  Sandwich Place   0.1
3             Gym   0.1
4    Skating Rink   0.1


----Bayview Village----
                 venue  freq
0                 Café  0.25
1  Japanese Restaurant  0.25
2                 Bank  0.25
3   Chinese Restaurant  0.25
4          Yoga Studio  0.00


----Bedford Park, Lawrence Manor East----
   

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [32]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [33]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, Richmond, King",Coffee Shop,Café,Thai Restaurant,Steakhouse,American Restaurant,Restaurant,Bar,Gym,Hotel,Cosmetics Shop
1,Agincourt,Lounge,Breakfast Spot,Clothing Store,Skating Rink,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
2,"Agincourt North, Steeles East, L'Amoreaux East...",Coffee Shop,Park,Playground,Women's Store,Donut Shop,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
3,"Alderwood, Long Branch",Pizza Place,Gym,Bank,Pharmacy,Pool,Pub,Sandwich Place,Skating Rink,Coffee Shop,General Entertainment
4,Bayview Village,Chinese Restaurant,Café,Japanese Restaurant,Bank,Drugstore,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Women's Store
5,"Bedford Park, Lawrence Manor East",Juice Bar,Fast Food Restaurant,Coffee Shop,Italian Restaurant,Comfort Food Restaurant,Thai Restaurant,Liquor Store,Sandwich Place,Restaurant,Butcher
6,Berczy Park,Coffee Shop,Cocktail Bar,Restaurant,Cheese Shop,Steakhouse,Beer Bar,Farmers Market,Seafood Restaurant,Café,Bakery
7,Business reply mail Processing Centre969 Eastern,Light Rail Station,Yoga Studio,Garden,Smoke Shop,Brewery,Spa,Farmers Market,Fast Food Restaurant,Burrito Place,Restaurant
8,"CFB Toronto, Downsview East",Park,Airport,Bus Stop,Women's Store,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
9,Caledonia-Fairbanks,Park,Women's Store,Fast Food Restaurant,Market,Pharmacy,Airport,Dessert Shop,Event Space,Ethiopian Restaurant,Empanada Restaurant


## 4. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [34]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
#kmeans.labels_[0:10] 
kmeans.labels_


array([2, 0, 4, 2, 2, 2, 2, 2, 4, 4, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0,
       2, 2, 0, 2, 2, 2, 4, 0, 2, 2, 4, 2, 2, 2, 2, 2, 0, 0, 2, 0, 2, 2,
       2, 2, 0, 4, 2, 2, 3, 2, 2, 2, 2, 2, 0, 2, 2, 3, 2, 2, 4, 2, 4, 0,
       2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 4, 2, 2, 2,
       4, 2, 2, 4, 2, 2, 2, 0, 2, 2, 2, 4, 2], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [35]:
df_codes

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Morningside, Guildwood, West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,M1J,43.744734,-79.239476
6,M1K,Scarborough,"Ionview, Kennedy Park, East Birchmount Park",M1K,43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",M1L,43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Scarborough Village West, Cliffcrest",M1M,43.716316,-79.239476
9,M1N,Scarborough,"Cliffside West, Birch Cliff",M1N,43.692657,-79.264848


In [36]:
df_codes_join = df_codes.rename(columns={'Neighbourhood':'Neighborhood'}) #['Neighbourhood','Latitude','Longitude']
result = toronto_grouped.join(df_codes_join.set_index('Neighborhood'), on='Neighborhood')
#result = pd.concat([toronto_grouped,df_codes_join],axis=1, join='inner', on='Neighborhood')
result

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Postcode,Borough,Postal Code,Latitude,Longitude
0,"Adelaide, Richmond, King",0.000000,0.01,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.010000,0.000000,0.0,0.010000,M5H,Downtown Toronto,M5H,43.650571,-79.384568
1,Agincourt,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,M1S,Scarborough,M1S,43.794200,-79.262029
2,"Agincourt North, Steeles East, L'Amoreaux East...",0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,M1V,Scarborough,M1V,43.815252,-79.284577
3,"Alderwood, Long Branch",0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,M8W,Etobicoke,M8W,43.602414,-79.543484
4,Bayview Village,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,M2K,North York,M2K,43.786947,-79.385975
5,"Bedford Park, Lawrence Manor East",0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,M5M,North York,M5M,43.733283,-79.419750
6,Berczy Park,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,M5E,Downtown Toronto,M5E,43.644771,-79.373306
7,Business reply mail Processing Centre969 Eastern,0.055556,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,M7Y,East Toronto,M7Y,43.662744,-79.321558
8,"CFB Toronto, Downsview East",0.000000,0.00,0.000000,0.000000,0.333333,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,M3K,North York,M3K,43.737473,-79.464763
9,Caledonia-Fairbanks,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.166667,M6E,York,M6E,43.689026,-79.453512


In [37]:
toronto_merged = result

# add clustering labels
toronto_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!
toronto_merged.columns.values

array(['Neighborhood', 'Yoga Studio', 'Accessories Store',
       'Adult Boutique', 'Afghan Restaurant', 'Airport',
       'Airport Food Court', 'Airport Gate', 'Airport Lounge',
       'Airport Service', 'Airport Terminal', 'American Restaurant',
       'Antique Shop', 'Aquarium', 'Arepa Restaurant', 'Art Gallery',
       'Art Museum', 'Arts & Crafts Store', 'Asian Restaurant',
       'Athletics & Sports', 'Auto Garage', 'Auto Workshop', 'BBQ Joint',
       'Baby Store', 'Bagel Shop', 'Bakery', 'Bank', 'Bar',
       'Baseball Field', 'Baseball Stadium', 'Basketball Court',
       'Basketball Stadium', 'Beach', 'Beer Bar', 'Beer Store',
       'Belgian Restaurant', 'Bike Shop', 'Bistro', 'Board Shop',
       'Boat or Ferry', 'Bookstore', 'Boutique', 'Brazilian Restaurant',
       'Breakfast Spot', 'Brewery', 'Bridal Shop', 'Bubble Tea Shop',
       'Building', 'Burger Joint', 'Burrito Place', 'Bus Line',
       'Bus Station', 'Bus Stop', 'Business Service', 'Butcher',
       'Cafeteria

Finally, let's visualize the resulting clusters

In [38]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examine Clusters

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

#### Cluster 1

In [39]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[0] + list(range(toronto_merged.shape[1]-10, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Agincourt,Lounge,Breakfast Spot,Clothing Store,Skating Rink,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
15,"Cliffside West, Birch Cliff",General Entertainment,College Stadium,Skating Rink,Café,Women's Store,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
16,"Cliffside, Scarborough Village West, Cliffcrest",Motel,Skating Rink,Movie Theater,American Restaurant,Women's Store,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run
21,Don Mills North,Gym / Fitness Center,Café,Japanese Restaurant,Basketball Court,Pool,Caribbean Restaurant,Baseball Field,Women's Store,Donut Shop,Discount Store
24,Downsview Central,Business Service,Korean Restaurant,Baseball Field,Home Service,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
29,"Emery, Humberlea",Construction & Landscaping,Baseball Field,Furniture / Home Store,Women's Store,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
38,"Highland Creek, Rouge Hill, Port Union",Moving Target,Bar,Women's Store,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
39,Hillcrest Village,Golf Course,Dog Run,Pool,Athletics & Sports,Mediterranean Restaurant,Women's Store,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner
41,Humewood-Cedarvale,Field,Trail,Hockey Arena,Department Store,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
46,"Lawrence Manor, Lawrence Heights",Clothing Store,Furniture / Home Store,Women's Store,Sporting Goods Shop,Event Space,Boutique,Coffee Shop,Accessories Store,Vietnamese Restaurant,Airport Food Court


#### Cluster 2

In [40]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[0] + list(range(toronto_merged.shape[1]-10, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
69,"Silver Hills, York Mills",Cafeteria,Women's Store,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Drugstore,Department Store


#### Cluster 3

In [41]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[0] + list(range(toronto_merged.shape[1]-10, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, Richmond, King",Coffee Shop,Café,Thai Restaurant,Steakhouse,American Restaurant,Restaurant,Bar,Gym,Hotel,Cosmetics Shop
3,"Alderwood, Long Branch",Pizza Place,Gym,Bank,Pharmacy,Pool,Pub,Sandwich Place,Skating Rink,Coffee Shop,General Entertainment
4,Bayview Village,Chinese Restaurant,Café,Japanese Restaurant,Bank,Drugstore,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Women's Store
5,"Bedford Park, Lawrence Manor East",Juice Bar,Fast Food Restaurant,Coffee Shop,Italian Restaurant,Comfort Food Restaurant,Thai Restaurant,Liquor Store,Sandwich Place,Restaurant,Butcher
6,Berczy Park,Coffee Shop,Cocktail Bar,Restaurant,Cheese Shop,Steakhouse,Beer Bar,Farmers Market,Seafood Restaurant,Café,Bakery
7,Business reply mail Processing Centre969 Eastern,Light Rail Station,Yoga Studio,Garden,Smoke Shop,Brewery,Spa,Farmers Market,Fast Food Restaurant,Burrito Place,Restaurant
10,Canada Post Gateway Processing Centre,Hotel,Coffee Shop,Gym / Fitness Center,American Restaurant,Fried Chicken Joint,Middle Eastern Restaurant,Burrito Place,Sandwich Place,Mediterranean Restaurant,Discount Store
11,Cedarbrae,Hakka Restaurant,Athletics & Sports,Fried Chicken Joint,Thai Restaurant,Bakery,Bank,Caribbean Restaurant,Discount Store,Dog Run,Doner Restaurant
12,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Burger Joint,Sandwich Place,Japanese Restaurant,Bubble Tea Shop,Bar,Falafel Restaurant,Ice Cream Shop
13,Christie,Grocery Store,Café,Park,Convenience Store,Coffee Shop,Restaurant,Italian Restaurant,Diner,Nightclub,Baby Store


#### Cluster 4

In [42]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[0] + list(range(toronto_merged.shape[1]-10, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
50,"Malvern, Rouge",Fast Food Restaurant,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,College Auditorium
59,Parkwoods,Fast Food Restaurant,Food & Drink Shop,Park,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore


#### Cluster 5

In [43]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[0] + list(range(toronto_merged.shape[1]-10, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,"Agincourt North, Steeles East, L'Amoreaux East...",Coffee Shop,Park,Playground,Women's Store,Donut Shop,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
8,"CFB Toronto, Downsview East",Park,Airport,Bus Stop,Women's Store,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
9,Caledonia-Fairbanks,Park,Women's Store,Fast Food Restaurant,Market,Pharmacy,Airport,Dessert Shop,Event Space,Ethiopian Restaurant,Empanada Restaurant
28,East Toronto,Convenience Store,Coffee Shop,Park,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
32,"Forest Hill West, Forest Hill North",Trail,Park,Sushi Restaurant,Jewelry Store,Women's Store,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
47,Lawrence Park,Bus Line,Park,Dim Sum Restaurant,Swim School,Doner Restaurant,Dessert Shop,Diner,Discount Store,Dog Run,Women's Store
62,"Richview Gardens, Kingsview Village, St. Phill...",Mobile Phone Shop,Pizza Place,Park,Donut Shop,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
64,Rosedale,Park,Trail,Playground,Women's Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
84,"The Kingsway, Montgomery Road, Old Mill North",Park,River,Smoke Shop,Women's Store,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
88,"Upwood Park, North Park, Maple Leaf Park",Park,Construction & Landscaping,Basketball Court,Bakery,Women's Store,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop
