# Coursera Capstone project

<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

## Introduction

We will scrape Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M to explore neighborhoods in Toronto, Canada.

We will use the **Search** function to get the most data on nearby **Hospitals, Schools, Shopping Malls, Parks & ATMs**, and then use this feature to group the neighborhoods into clusters. You will use the *k*-means clustering algorithm to complete this task.


<h2><font size = 4>1. Transforming the data in the table on the Wikipedia page into the a pandas dataframe.</font></h2>

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!pip install beautifulsoup4
from bs4 import BeautifulSoup #the BeautifulSoup package, web scraping library

print('Libraries imported.')

Libraries imported.


In [2]:
# Using Beautiful Soup library to fetch data from Wikipedia page
# Load article, turn into soup and get the table.

import requests

website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())


<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"ff1dbaab-0815-40f9-9a0d-4f9528406a80","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":969510799,"wgRevisionId":969510799,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Communicati

In [3]:
# Extracting table

My_table = soup.find('table',{'class':'wikitable sortable'})
My_table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td

In [4]:
# Extracting columns of the table to lists

PostalCode=[]
Borough=[]
Neighbourhood=[]

for row in My_table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells) == 3:
        PostalCode.append(str(cells[0].find(text=True)))
        Borough.append(str(cells[1].find(text=True)))
        Neighbourhood.append(str(cells[2].find(text=True)))


In [5]:
# Create Dataframe with lists

df = pd.DataFrame()
df['PostalCode'] = PostalCode
df['Borough'] = Borough
df['Neighbourhood'] = Neighbourhood

print ('Dataframe size: ',df.shape, '\nDatatypes:',df.dtypes)
df.head()

Dataframe size:  (180, 3) 
Datatypes: PostalCode       object
Borough          object
Neighbourhood    object
dtype: object


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


In [6]:
# Dropping Rows where Borough is 'Not Assigned'

df['PostalCode'] = df['PostalCode'].str.replace('\n','')
df['Borough'] = df['Borough'].str.replace('\n','')
df['Neighbourhood'] = df['Neighbourhood'].str.replace('\n','')

df = df[df.Borough != 'Not assigned']
df.dropna()


df.reset_index(drop=True, inplace=True)

print ('Dataframe size: ',df.shape, '\nDatatypes:',df.dtypes)
df.head()

Dataframe size:  (103, 3) 
Datatypes: PostalCode       object
Borough          object
Neighbourhood    object
dtype: object


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [7]:
# If a cell has a borough but a 'Not assigned' neighborhood, then assigning neighborhood same as borough.

df.Neighbourhood = df.Neighbourhood.replace("Not Assigned", df.Borough)

print ('Dataframe size: ',df.shape, '\nDatatypes:', df.dtypes)
df.head()

Dataframe size:  (103, 3) 
Datatypes: PostalCode       object
Borough          object
Neighbourhood    object
dtype: object


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


<h2><font size = 4>2. Add Latitude & Longitude columns to DataFrame</font></h2>

### Geocoder attempt not fetching results taking too long

#!pip install geocoder

#import geocoder # importing geocoder

latitude=[]
longitude=[]

for index, row in df.iterrows():
    
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(row['PostalCode']))
      lat_lng_coords = g.latlng

    latitude.append(lat_lng_coords[0]) 
    longitude.append(lat_lng_coords[1])

print (latitude[5], longitude[5])

In [8]:
# Reading Cooodinates from the link provided

df_coords = pd.read_csv('http://cocl.us/Geospatial_data')

df_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
# Merging dataframes

df_new = pd.merge(df, df_coords, left_on='PostalCode', right_on='Postal Code', copy=True, indicator=False)
df_new = df_new.drop(['Postal Code'], axis=1)

In [10]:
print ('Dataframe size: ',df_new.shape, '\nDatatypes:', df_new.dtypes)
df_new.head()

Dataframe size:  (103, 5) 
Datatypes: PostalCode        object
Borough           object
Neighbourhood     object
Latitude         float64
Longitude        float64
dtype: object


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


<h2><font size = 4>3. Exploring and clustering the neighborhoods in Toronto</font></h2>

In [11]:
# Get geograpical coordinates of Toronto, Canada

address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Canada are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, Canada are 43.6534817, -79.3839347.


In [13]:
# installing Foliumm, a map rendering library

#!conda install -c conda-forge folium

import folium 

In [14]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_new['Latitude'], df_new['Longitude'], df_new['Borough'], df_new['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [15]:
# Only working with neighbourhood with Toronto in the name

df_toronto = df_new[df_new.Borough.str.contains('Toronto', case=False)]

print ('Dataframe size: ',df_toronto.shape, '\nDatatypes:', df_toronto.dtypes)
df_toronto.head()

Dataframe size:  (39, 5) 
Datatypes: PostalCode        object
Borough           object
Neighbourhood     object
Latitude         float64
Longitude        float64
dtype: object


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [16]:
# create map of Toronto using df_toronto DataFrame

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Foursquare API


In [17]:
# Foursquare Credentails

CLIENT_ID = 'R2WFQCLNV1BKLH3RJKXFMJ5UJRVRFFQ330NKSKJNY1GS1D4J' # your Foursquare ID
CLIENT_SECRET = 'YGMRHRRU3SYBSITDHALYPYMJL1TCN1HF3MSQZUM5HGUJ5JHN' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: R2WFQCLNV1BKLH3RJKXFMJ5UJRVRFFQ330NKSKJNY1GS1D4J
CLIENT_SECRET:YGMRHRRU3SYBSITDHALYPYMJL1TCN1HF3MSQZUM5HGUJ5JHN


In [18]:
# Creating Foursquare API url to analyze JSON  

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            latitude, longitude, 
            VERSION, 
            'hospital', 
            500, 
            1)
    
# making a GET request to analyze JSON  
requests.get(url).json()

{'meta': {'code': 200, 'requestId': '5f2cbd2c29b3b45474c9a2ac'},
 'response': {'venues': [{'id': '4ad4c064f964a5206ef820e3',
    'name': 'The Hospital for Sick Children (SickKids)',
    'location': {'address': '555 University Ave.',
     'crossStreet': 'at Gerrard St.',
     'lat': 43.657498668962646,
     'lng': -79.3865121609307,
     'labeledLatLngs': [{'label': 'display',
       'lat': 43.657498668962646,
       'lng': -79.3865121609307}],
     'distance': 492,
     'postalCode': 'M5G 1X8',
     'cc': 'CA',
     'city': 'Toronto',
     'state': 'ON',
     'country': 'Canada',
     'formattedAddress': ['555 University Ave. (at Gerrard St.)',
      'Toronto ON M5G 1X8',
      'Canada']},
    'categories': [{'id': '4bf58dd8d48988d196941735',
      'name': 'Hospital',
      'pluralName': 'Hospitals',
      'shortName': 'Hospital',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/building/medical_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-

In [19]:
# Function to fetch nearby location data based on query from Foursquare API

def get_location_data(data, search_query, radius=500, LIMIT = 50):

    place = []
    nearby_places = pd.DataFrame(columns=['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 
                                             'Place Name', 'Place Latitude', 'Place Longitude', 'Category'])


    for index, row in data.iterrows():

        # Creating Foursquare API url
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                row['Latitude'], row['Longitude'], 
                VERSION, 
                search_query, 
                radius, 
                LIMIT)

        # make the GET request
        results = requests.get(url).json()["response"]['venues']

        # return only relevant information for each hospital
        for v in results:

            categories_list = v['categories']
            if len(categories_list) == 0:
                category = None
            else:
                category = categories_list[0]['name']

            if category == search_query:
                place = [row['Neighbourhood'], row['Latitude'], row['Longitude'], v['name'], v['location']['lat'], v['location']['lng'], category]
                df_length = len(nearby_places)
                nearby_places.loc[df_length] = place

    if results != None:
        print("Data Download Succesful")
        return(nearby_places)

###  Fetch data for all parameters

In [20]:
# Fetching data on nearby hospitals

nearby_hospitals = get_location_data(df_toronto, 'Hospital')

print ('Dataframe size: ', nearby_hospitals.shape, '\nDatatypes:\n', nearby_hospitals.dtypes)
nearby_hospitals.head()

Data Download Succesful
Dataframe size:  (78, 7) 
Datatypes:
 Neighborhood               object
Neighborhood Latitude     float64
Neighborhood Longitude    float64
Place Name                 object
Place Latitude            float64
Place Longitude           float64
Category                   object
dtype: object


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Place Name,Place Latitude,Place Longitude,Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Bay Cat Hospital,43.655393,-79.35854,Hospital
1,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Women's College Hospital,43.661491,-79.387602,Hospital
2,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Toronto General Hospital,43.658762,-79.388292,Hospital
3,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Mount Sinai Hospital Women's and Infants' Depa...,43.659612,-79.390761,Hospital
4,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,"Mount Sinai Hospital, Joseph and Wolf Lebovic ...",43.658247,-79.391473,Hospital


In [21]:
# Fetching data on nearby schools

nearby_schools = get_location_data(df_toronto, 'School')

print ('Dataframe size: ', nearby_schools.shape, '\nDatatypes:\n', nearby_schools.dtypes)
nearby_schools.head()

Data Download Succesful
Dataframe size:  (45, 7) 
Datatypes:
 Neighborhood               object
Neighborhood Latitude     float64
Neighborhood Longitude    float64
Place Name                 object
Place Latitude            float64
Place Longitude           float64
Category                   object
dtype: object


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Place Name,Place Latitude,Place Longitude,Category
0,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Orde Street Junior Public School,43.658435,-79.392295,School
1,"Garden District, Ryerson",43.657162,-79.378937,Ryerson Daphne Cockwell School,43.65701,-79.37765,School
2,The Beaches,43.676357,-79.293031,St.John Catholic School,43.680676,-79.294542,School
3,The Beaches,43.676357,-79.293031,Balmy Beach School,43.676199,-79.290134,School
4,The Beaches,43.676357,-79.293031,St. Denis Catholic School,43.672881,-79.290056,School


In [22]:
# Fetching data on nearby parks

nearby_parks = get_location_data(df_toronto, 'Park')

print ('Dataframe size: ', nearby_parks.shape, '\nDatatypes:\n', nearby_parks.dtypes)
nearby_parks.head()

Data Download Succesful
Dataframe size:  (160, 7) 
Datatypes:
 Neighborhood               object
Neighborhood Latitude     float64
Neighborhood Longitude    float64
Place Name                 object
Place Latitude            float64
Place Longitude           float64
Category                   object
dtype: object


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Place Name,Place Latitude,Place Longitude,Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Parliament Square Park,43.650264,-79.362195,Park
1,"Regent Park, Harbourfront",43.65426,-79.360636,Underpass Park,43.655764,-79.354806,Park
2,"Regent Park, Harbourfront",43.65426,-79.360636,Percy Park,43.65518,-79.357421,Park
3,"Regent Park, Harbourfront",43.65426,-79.360636,Taddle Creek Parkette,43.653217,-79.363934,Park
4,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Queen's Park,43.663946,-79.39218,Park


In [23]:
# Fetching data on nearby mall

nearby_malls = get_location_data(df_toronto, 'Shopping Mall', 1000)

print ('Dataframe size: ', nearby_malls.shape, '\nDatatypes:\n', nearby_malls.dtypes)
nearby_malls.head()

Data Download Succesful
Dataframe size:  (31, 7) 
Datatypes:
 Neighborhood               object
Neighborhood Latitude     float64
Neighborhood Longitude    float64
Place Name                 object
Place Latitude            float64
Place Longitude           float64
Category                   object
dtype: object


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Place Name,Place Latitude,Place Longitude,Category
0,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,CF Toronto Eaton Centre,43.65454,-79.380677,Shopping Mall
1,"Garden District, Ryerson",43.657162,-79.378937,CF Toronto Eaton Centre,43.65454,-79.380677,Shopping Mall
2,"Garden District, Ryerson",43.657162,-79.378937,TD Centre Shopping Concourse,43.647184,-79.380932,Shopping Mall
3,St. James Town,43.651494,-79.375418,CF Toronto Eaton Centre,43.65454,-79.380677,Shopping Mall
4,St. James Town,43.651494,-79.375418,TD Centre Shopping Concourse,43.647184,-79.380932,Shopping Mall


In [24]:
# Fetching data on nearby ATMs

nearby_ATM = get_location_data(df_toronto, 'ATM')

print ('Dataframe size: ', nearby_ATM.shape, '\nDatatypes:\n', nearby_ATM.dtypes)
nearby_ATM.head()

Data Download Succesful
Dataframe size:  (29, 7) 
Datatypes:
 Neighborhood               object
Neighborhood Latitude     float64
Neighborhood Longitude    float64
Place Name                 object
Place Latitude            float64
Place Longitude           float64
Category                   object
dtype: object


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Place Name,Place Latitude,Place Longitude,Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,President's Choice Financial ATM,43.655461,-79.364049,ATM
1,"Regent Park, Harbourfront",43.65426,-79.360636,President's Choice Financial ATM,43.651418,-79.365947,ATM
2,"Garden District, Ryerson",43.657162,-79.378937,President's Choice Financial ATM,43.654064,-79.380696,ATM
3,"Garden District, Ryerson",43.657162,-79.378937,President's Choice Financial ATM,43.661822,-79.383028,ATM
4,"Garden District, Ryerson",43.657162,-79.378937,BMO ATM,43.658378,-79.377554,ATM


In [25]:
# Concatinationg all DataFrame

neighborhood = pd.concat([nearby_hospitals, nearby_schools, nearby_parks, nearby_malls, nearby_ATM])

print ('Dataframe size: ', neighborhood.shape)
neighborhood.head()

Dataframe size:  (343, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Place Name,Place Latitude,Place Longitude,Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Bay Cat Hospital,43.655393,-79.35854,Hospital
1,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Women's College Hospital,43.661491,-79.387602,Hospital
2,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Toronto General Hospital,43.658762,-79.388292,Hospital
3,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Mount Sinai Hospital Women's and Infants' Depa...,43.659612,-79.390761,Hospital
4,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,"Mount Sinai Hospital, Joseph and Wolf Lebovic ...",43.658247,-79.391473,Hospital


### Exploring Data

In [26]:
#Checking number of places in the neighbourhoods

neighborhood.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Place Name,Place Latitude,Place Longitude,Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,6,6,6,6,6,6
"Brockton, Parkdale Village, Exhibition Place",8,8,8,8,8,8
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",1,1,1,1,1,1
Central Bay Street,31,31,31,31,31,31
Christie,3,3,3,3,3,3
Church and Wellesley,15,15,15,15,15,15
"Commerce Court, Victoria Hotel",18,18,18,18,18,18
Davisville,5,5,5,5,5,5
Davisville North,1,1,1,1,1,1
"Dufferin, Dovercourt Village",8,8,8,8,8,8


In [29]:
# one hot encoding
neighborhood_onehot = pd.get_dummies(neighborhood[['Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
neighborhood_onehot['Neighborhood'] = neighborhood['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [neighborhood_onehot.columns[-1]] + list(neighborhood_onehot.columns[:-1])
neighborhood_onehot = neighborhood_onehot[fixed_columns]

neighborhood_onehot.head()

Unnamed: 0,Neighborhood,ATM,Hospital,Park,School,Shopping Mall
0,"Regent Park, Harbourfront",0,1,0,0,0
1,"Queen's Park, Ontario Provincial Government",0,1,0,0,0
2,"Queen's Park, Ontario Provincial Government",0,1,0,0,0
3,"Queen's Park, Ontario Provincial Government",0,1,0,0,0
4,"Queen's Park, Ontario Provincial Government",0,1,0,0,0


In [40]:
# Grouping the DataFrame by feature Neighbourhood and normalizing data by mean

neighborhood_grouped = neighborhood_onehot.groupby('Neighborhood').mean().reset_index()
neighborhood_grouped

Unnamed: 0,Neighborhood,ATM,Hospital,Park,School,Shopping Mall
0,Berczy Park,0.0,0.0,0.833333,0.0,0.166667
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.625,0.125,0.25
2,"Business reply mail Processing Centre, South C...",0.0,0.0,1.0,0.0,0.0
3,Central Bay Street,0.096774,0.709677,0.096774,0.032258,0.064516
4,Christie,0.0,0.0,0.666667,0.333333,0.0
5,Church and Wellesley,0.066667,0.133333,0.666667,0.133333,0.0
6,"Commerce Court, Victoria Hotel",0.111111,0.166667,0.5,0.111111,0.111111
7,Davisville,0.2,0.0,0.6,0.2,0.0
8,Davisville North,0.0,0.0,1.0,0.0,0.0
9,"Dufferin, Dovercourt Village",0.0,0.0,0.75,0.125,0.125


## Clustering Neighborhoods

In [41]:
# setting number of clusters
kclusters = 5

neighborhood_grouped_clustering = neighborhood_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(neighborhood_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 0, 4, 2, 0, 0, 0, 0, 4, 0], dtype=int32)

In [42]:
# add clustering labels

neighborhood_grouped.insert(0, 'Cluster Labels', kmeans.labels_)

neighborhood_grouped.head()

Unnamed: 0,Cluster Labels,Neighborhood,ATM,Hospital,Park,School,Shopping Mall
0,4,Berczy Park,0.0,0.0,0.833333,0.0,0.166667
1,0,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.625,0.125,0.25
2,4,"Business reply mail Processing Centre, South C...",0.0,0.0,1.0,0.0,0.0
3,2,Central Bay Street,0.096774,0.709677,0.096774,0.032258,0.064516
4,0,Christie,0.0,0.0,0.666667,0.333333,0.0


In [43]:
# Merging with main Data Set

df_toronto_merged = df_toronto

# merging neighborhood_grouped with df_toronto to add latitude/longitude for each neighborhood
df_toronto_merged = df_toronto_merged.join(neighborhood_grouped.set_index('Neighborhood'), on='Neighbourhood')

df_toronto_merged.dropna(inplace=True)
df_toronto_merged['Cluster Labels'] = df_toronto_merged['Cluster Labels'].astype(int)

print ('Dataframe size: ', df_toronto_merged.shape, '\nDatatypes:\n', df_toronto_merged.dtypes)
df_toronto_merged # check the last columns!

Dataframe size:  (38, 11) 
Datatypes:
 PostalCode         object
Borough            object
Neighbourhood      object
Latitude          float64
Longitude         float64
Cluster Labels      int64
ATM               float64
Hospital          float64
Park              float64
School            float64
Shopping Mall     float64
dtype: object


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,ATM,Hospital,Park,School,Shopping Mall
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,0.285714,0.142857,0.571429,0.0,0.0
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,2,0.0,0.758621,0.172414,0.034483,0.034483
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,2,0.238095,0.47619,0.142857,0.047619,0.095238
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,2,0.055556,0.5,0.333333,0.0,0.111111
19,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,0.0,0.0,0.375,0.5,0.125
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,4,0.0,0.0,0.833333,0.0,0.166667
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,2,0.096774,0.709677,0.096774,0.032258,0.064516
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564,0,0.0,0.0,0.666667,0.333333,0.0
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568,0,0.214286,0.142857,0.428571,0.071429,0.142857
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259,0,0.0,0.0,0.75,0.125,0.125


### Visualizing Clusters

In [44]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_toronto_merged['Latitude'], df_toronto_merged['Longitude'], df_toronto_merged['Neighbourhood'], df_toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examining Clusters

Cluster 1

In [53]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 0, df_toronto_merged.columns[[2] + list(range(5, df_toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,ATM,Hospital,Park,School,Shopping Mall
2,"Regent Park, Harbourfront",0,0.285714,0.142857,0.571429,0.0,0.0
25,Christie,0,0.0,0.0,0.666667,0.333333,0.0
30,"Richmond, Adelaide, King",0,0.214286,0.142857,0.428571,0.071429,0.142857
31,"Dufferin, Dovercourt Village",0,0.0,0.0,0.75,0.125,0.125
36,"Harbourfront East, Union Station, Toronto Islands",0,0.083333,0.0,0.75,0.083333,0.083333
41,"The Danforth West, Riverdale",0,0.0,0.0,0.6,0.2,0.2
42,"Toronto Dominion Centre, Design Exchange",0,0.2,0.0,0.533333,0.133333,0.133333
43,"Brockton, Parkdale Village, Exhibition Place",0,0.0,0.0,0.625,0.125,0.25
48,"Commerce Court, Victoria Hotel",0,0.111111,0.166667,0.5,0.111111,0.111111
54,Studio District,0,0.125,0.0,0.625,0.125,0.125


Cluster 2

In [54]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 1, df_toronto_merged.columns[[2] + list(range(5, df_toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,ATM,Hospital,Park,School,Shopping Mall
19,The Beaches,1,0.0,0.0,0.375,0.5,0.125
37,"Little Portugal, Trinity",1,0.0,0.0,0.4,0.4,0.2
62,Roselawn,1,0.0,0.0,0.0,0.5,0.5
68,"Forest Hill North & West, Forest Hill Road Park",1,0.0,0.0,0.4,0.4,0.2
81,"Runnymede, Swansea",1,0.142857,0.0,0.428571,0.428571,0.0
86,"Summerhill West, Rathnelly, South Hill, Forest...",1,0.0,0.0,0.25,0.75,0.0


Cluster 3

In [55]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 2, df_toronto_merged.columns[[2] + list(range(5, df_toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,ATM,Hospital,Park,School,Shopping Mall
4,"Queen's Park, Ontario Provincial Government",2,0.0,0.758621,0.172414,0.034483,0.034483
9,"Garden District, Ryerson",2,0.238095,0.47619,0.142857,0.047619,0.095238
15,St. James Town,2,0.055556,0.5,0.333333,0.0,0.111111
24,Central Bay Street,2,0.096774,0.709677,0.096774,0.032258,0.064516


Cluster 4

In [56]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 3, df_toronto_merged.columns[[2] + list(range(5, df_toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,ATM,Hospital,Park,School,Shopping Mall
75,"Parkdale, Roncesvalles",3,0.0,0.0,0.0,0.0,1.0


Cluster 5

In [57]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 4, df_toronto_merged.columns[[2] + list(range(5, df_toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,ATM,Hospital,Park,School,Shopping Mall
20,Berczy Park,4,0.0,0.0,0.833333,0.0,0.166667
47,"India Bazaar, The Beaches West",4,0.0,0.0,1.0,0.0,0.0
67,Davisville North,4,0.0,0.0,1.0,0.0,0.0
73,"North Toronto West, Lawrence Park",4,0.0,0.0,1.0,0.0,0.0
91,Rosedale,4,0.0,0.0,1.0,0.0,0.0
92,Stn A PO Boxes,4,0.0,0.0,0.833333,0.0,0.166667
100,"Business reply mail Processing Centre, South C...",4,0.0,0.0,1.0,0.0,0.0
